1. 15 Aug, 2012 35 commits
    • Mel Gorman's avatar
      mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables · 4c9682c5
      Mel Gorman authored
      commit d833352a upstream.
      
      If a process creates a large hugetlbfs mapping that is eligible for page
      table sharing and forks heavily with children some of whom fault and
      others which destroy the mapping then it is possible for page tables to
      get corrupted.  Some teardowns of the mapping encounter a "bad pmd" and
      output a message to the kernel log.  The final teardown will trigger a
      BUG_ON in mm/filemap.c.
      
      This was reproduced in 3.4 but is known to have existed for a long time
      and goes back at least as far as 2.6.37.  It was probably was introduced
      in 2.6.20 by [39dde65c: shared page table for hugetlb page].  The messages
      look like this;
      
      [  ..........] Lots of bad pmd messages followed by this
      [  127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
      [  127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
      [  127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
      [  127.186778] ------------[ cut here ]------------
      [  127.186781] kernel BUG at mm/filemap.c:134!
      [  127.186782] invalid opcode: 0000 [#1] SMP
      [  127.186783] CPU 7
      [  127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
      [  127.186801]
      [  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
      [  127.186804] RIP: 0010:[<ffffffff810ed6ce>]  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
      [  127.186809] RSP: 0000:ffff8804144b5c08  EFLAGS: 00010002
      [  127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
      [  127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
      [  127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
      [  127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
      [  127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
      [  127.186815] FS:  00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
      [  127.186816] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
      [  127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
      [  127.186821] Stack:
      [  127.186822]  ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
      [  127.186824]  ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
      [  127.186825]  ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
      [  127.186827] Call Trace:
      [  127.186829]  [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
      [  127.186832]  [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
      [  127.186834]  [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
      [  127.186837]  [<ffffffff811655c7>] evict+0xa7/0x1b0
      [  127.186839]  [<ffffffff811657a3>] iput_final+0xd3/0x1f0
      [  127.186840]  [<ffffffff811658f9>] iput+0x39/0x50
      [  127.186842]  [<ffffffff81162708>] d_kill+0xf8/0x130
      [  127.186843]  [<ffffffff81162812>] dput+0xd2/0x1a0
      [  127.186845]  [<ffffffff8114e2d0>] __fput+0x170/0x230
      [  127.186848]  [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
      [  127.186849]  [<ffffffff8114e3ad>] fput+0x1d/0x30
      [  127.186851]  [<ffffffff81117db7>] remove_vma+0x37/0x80
      [  127.186853]  [<ffffffff81119182>] do_munmap+0x2d2/0x360
      [  127.186855]  [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
      [  127.186857]  [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
      [  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
      [  127.186868] RIP  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
      [  127.186870]  RSP <ffff8804144b5c08>
      [  127.186871] ---[ end trace 7cbac5d1db69f426 ]---
      
      The bug is a race and not always easy to reproduce.  To reproduce it I was
      doing the following on a single socket I7-based machine with 16G of RAM.
      
      $ hugeadm --pool-pages-max DEFAULT:13G
      $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
      $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
      $ for i in `seq 1 9000`; do ./hugetlbfs-test; done
      
      On my particular machine, it usually triggers within 10 minutes but
      enabling debug options can change the timing such that it never hits.
      Once the bug is triggered, the machine is in trouble and needs to be
      rebooted.  The machine will respond but processes accessing proc like "ps
      aux" will hang due to the BUG_ON.  shutdown will also hang and needs a
      hard reset or a sysrq-b.
      
      The basic problem is a race between page table sharing and teardown.  For
      the most part page table sharing depends on i_mmap_mutex.  In some cases,
      it is also taking the mm->page_table_lock for the PTE updates but with
      shared page tables, it is the i_mmap_mutex that is more important.
      
      Unfortunately it appears to be also insufficient. Consider the following
      situation
      
      Process A					Process B
      ---------					---------
      hugetlb_fault					shmdt
        						LockWrite(mmap_sem)
          						  do_munmap
      						    unmap_region
      						      unmap_vmas
      						        unmap_single_vma
      						          unmap_hugepage_range
            						            Lock(i_mmap_mutex)
      							    Lock(mm->page_table_lock)
      							    huge_pmd_unshare/unmap tables <--- (1)
      							    Unlock(mm->page_table_lock)
            						            Unlock(i_mmap_mutex)
        huge_pte_alloc				      ...
          Lock(i_mmap_mutex)				      ...
          vma_prio_walk, find svma, spte		      ...
          Lock(mm->page_table_lock)			      ...
          share spte					      ...
          Unlock(mm->page_table_lock)			      ...
          Unlock(i_mmap_mutex)			      ...
        hugetlb_no_page									  <--- (2)
      						      free_pgtables
      						        unlink_file_vma
      							hugetlb_free_pgd_range
      						    remove_vma_list
      
      In this scenario, it is possible for Process A to share page tables with
      Process B that is trying to tear them down.  The i_mmap_mutex on its own
      does not prevent Process A walking Process B's page tables.  At (1) above,
      the page tables are not shared yet so it unmaps the PMDs.  Process A sets
      up page table sharing and at (2) faults a new entry.  Process B then trips
      up on it in free_pgtables.
      
      This patch fixes the problem by adding a new function
      __unmap_hugepage_range_final that is only called when the VMA is about to
      be destroyed.  This function clears VM_MAYSHARE during
      unmap_hugepage_range() under the i_mmap_mutex.  This makes the VMA
      ineligible for sharing and avoids the race.  Superficially this looks like
      it would then be vunerable to truncate and madvise issues but hugetlbfs
      has its own truncate handlers so does not use unmap_mapping_range() and
      does not support madvise(DONTNEED).
      
      This should be treated as a -stable candidate if it is merged.
      
      Test program is as follows. The test case was mostly written by Michal
      Hocko with a few minor changes to reproduce this bug.
      
      ==== CUT HERE ====
      
      static size_t huge_page_size = (2UL << 20);
      static size_t nr_huge_page_A = 512;
      static size_t nr_huge_page_B = 5632;
      
      unsigned int get_random(unsigned int max)
      {
      	struct timeval tv;
      
      	gettimeofday(&tv, NULL);
      	srandom(tv.tv_usec);
      	return random() % max;
      }
      
      static void play(void *addr, size_t size)
      {
      	unsigned char *start = addr,
      		      *end = start + size,
      		      *a;
      	start += get_random(size/2);
      
      	/* we could itterate on huge pages but let's give it more time. */
      	for (a = start; a < end; a += 4096)
      		*a = 0;
      }
      
      int main(int argc, char **argv)
      {
      	key_t key = IPC_PRIVATE;
      	size_t sizeA = nr_huge_page_A * huge_page_size;
      	size_t sizeB = nr_huge_page_B * huge_page_size;
      	int shmidA, shmidB;
      	void *addrA = NULL, *addrB = NULL;
      	int nr_children = 300, n = 0;
      
      	if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
      		perror("shmget:");
      		return 1;
      	}
      
      	if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
      		perror("shmat");
      		return 1;
      	}
      	if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
      		perror("shmget:");
      		return 1;
      	}
      
      	if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
      		perror("shmat");
      		return 1;
      	}
      
      fork_child:
      	switch(fork()) {
      		case 0:
      			switch (n%3) {
      			case 0:
      				play(addrA, sizeA);
      				break;
      			case 1:
      				play(addrB, sizeB);
      				break;
      			case 2:
      				break;
      			}
      			break;
      		case -1:
      			perror("fork:");
      			break;
      		default:
      			if (++n < nr_children)
      				goto fork_child;
      			play(addrA, sizeA);
      			break;
      	}
      	shmdt(addrA);
      	shmdt(addrB);
      	do {
      		wait(NULL);
      	} while (--n > 0);
      	shmctl(shmidA, IPC_RMID, NULL);
      	shmctl(shmidB, IPC_RMID, NULL);
      	return 0;
      }
      
      [akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build]
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      4c9682c5
    • Borislav Petkov's avatar
      x86, microcode: Sanitize per-cpu microcode reloading interface · bb014c40
      Borislav Petkov authored
      commit c9fc3f77 upstream.
      
      Microcode reloading in a per-core manner is a very bad idea for both
      major x86 vendors. And the thing is, we have such interface with which
      we can end up with different microcode versions applied on different
      cores of an otherwise homogeneous wrt (family,model,stepping) system.
      
      So turn off the possibility of doing that per core and allow it only
      system-wide.
      
      This is a minimal fix which we'd like to see in stable too thus the
      more-or-less arbitrary decision to allow system-wide reloading only on
      the BSP:
      
      $ echo 1 > /sys/devices/system/cpu/cpu0/microcode/reload
      ...
      
      and disable the interface on the other cores:
      
      $ echo 1 > /sys/devices/system/cpu/cpu23/microcode/reload
      -bash: echo: write error: Invalid argument
      
      Also, allowing the reload only from one CPU (the BSP in
      that case) doesn't allow the reload procedure to degenerate
      into an O(n^2) deal when triggering reloads from all
      /sys/devices/system/cpu/cpuX/microcode/reload sysfs nodes
      simultaneously.
      
      A more generic fix will follow.
      Signed-off-by: default avatarBorislav Petkov <borislav.petkov@amd.com>
      Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1340280437-7718-2-git-send-email-bp@amd64.orgSigned-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bb014c40
    • Shuah Khan's avatar
      x86, microcode: microcode_core.c simple_strtoul cleanup · d426c789
      Shuah Khan authored
      commit e826abd5 upstream.
      
      Change reload_for_cpu() in kernel/microcode_core.c to call kstrtoul()
      instead of calling obsoleted simple_strtoul().
      Signed-off-by: default avatarShuah Khan <shuahkhan@gmail.com>
      Reviewed-by: default avatarBorislav Petkov <bp@alien8.de>
      Link: http://lkml.kernel.org/r/1336324264.2897.9.camel@lorien2Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d426c789
    • H. Peter Anvin's avatar
      random: mix in architectural randomness in extract_buf() · 9f608240
      H. Peter Anvin authored
      commit d2e7c96a upstream.
      
      Mix in any architectural randomness in extract_buf() instead of
      xfer_secondary_buf().  This allows us to mix in more architectural
      randomness, and it also makes xfer_secondary_buf() faster, moving a
      tiny bit of additional CPU overhead to process which is extracting the
      randomness.
      
      [ Commit description modified by tytso to remove an extended
        advertisement for the RDRAND instruction. ]
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: DJ Johnston <dj.johnston@intel.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9f608240
    • Tony Luck's avatar
      dmi: Feed DMI table to /dev/random driver · 4f4cb6f7
      Tony Luck authored
      commit d114a333 upstream.
      
      Send the entire DMI (SMBIOS) table to the /dev/random driver to
      help seed its pools.
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4f4cb6f7
    • Tony Luck's avatar
      random: Add comment to random_initialize() · dbbdd2bb
      Tony Luck authored
      commit cbc96b75 upstream.
      
      Many platforms have per-machine instance data (serial numbers,
      asset tags, etc.) squirreled away in areas that are accessed
      during early system bringup. Mixing this data into the random
      pools has a very high value in providing better random data,
      so we should allow (and even encourage) architecture code to
      call add_device_randomness() from the setup_arch() paths.
      
      However, this limits our options for internal structure of
      the random driver since random_initialize() is not called
      until long after setup_arch().
      
      Add a big fat comment to rand_initialize() spelling out
      this requirement.
      Suggested-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dbbdd2bb
    • Theodore Ts'o's avatar
      random: remove rand_initialize_irq() · b6b847a9
      Theodore Ts'o authored
      commit c5857ccf upstream.
      
      With the new interrupt sampling system, we are no longer using the
      timer_rand_state structure in the irq descriptor, so we can stop
      initializing it now.
      
      [ Merged in fixes from Sedat to find some last missing references to
        rand_initialize_irq() ]
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b6b847a9
    • Mark Brown's avatar
      mfd: wm831x: Feed the device UUID into device_add_randomness() · f99ef862
      Mark Brown authored
      commit 27130f0c upstream.
      
      wm831x devices contain a unique ID value. Feed this into the newly added
      device_add_randomness() to add some per device seed data to the pool.
      Signed-off-by: default avatarMark Brown <broonie@opensource.wolfsonmicro.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f99ef862
    • Mark Brown's avatar
      rtc: wm831x: Feed the write counter into device_add_randomness() · 2fcadd93
      Mark Brown authored
      commit 9dccf55f upstream.
      
      The tamper evident features of the RTC include the "write counter" which
      is a pseudo-random number regenerated whenever we set the RTC. Since this
      value is unpredictable it should provide some useful seeding to the random
      number generator.
      
      Only do this on boot since the goal is to seed the pool rather than add
      useful entropy.
      Signed-off-by: default avatarMark Brown <broonie@opensource.wolfsonmicro.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2fcadd93
    • Theodore Ts'o's avatar
      MAINTAINERS: Theodore Ts'o is taking over the random driver · 07895209
      Theodore Ts'o authored
      commit 330e0a01 upstream.
      
      Matt Mackall stepped down as the /dev/random driver maintainer last
      year, so Theodore Ts'o is taking back the /dev/random driver.
      
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      07895209
    • Theodore Ts'o's avatar
    • Theodore Ts'o's avatar
      random: add new get_random_bytes_arch() function · efe6c422
      Theodore Ts'o authored
      commit c2557a30 upstream.
      
      Create a new function, get_random_bytes_arch() which will use the
      architecture-specific hardware random number generator if it is
      present.  Change get_random_bytes() to not use the HW RNG, even if it
      is avaiable.
      
      The reason for this is that the hw random number generator is fast (if
      it is present), but it requires that we trust the hardware
      manufacturer to have not put in a back door.  (For example, an
      increasing counter encrypted by an AES key known to the NSA.)
      
      It's unlikely that Intel (for example) was paid off by the US
      Government to do this, but it's impossible for them to prove otherwise
        --- especially since Bull Mountain is documented to use AES as a
      whitener.  Hence, the output of an evil, trojan-horse version of
      RDRAND is statistically indistinguishable from an RDRAND implemented
      to the specifications claimed by Intel.  Short of using a tunnelling
      electronic microscope to reverse engineer an Ivy Bridge chip and
      disassembling and analyzing the CPU microcode, there's no way for us
      to tell for sure.
      
      Since users of get_random_bytes() in the Linux kernel need to be able
      to support hardware systems where the HW RNG is not present, most
      time-sensitive users of this interface have already created their own
      cryptographic RNG interface which uses get_random_bytes() as a seed.
      So it's much better to use the HW RNG to improve the existing random
      number generator, by mixing in any entropy returned by the HW RNG into
      /dev/random's entropy pool, but to always _use_ /dev/random's entropy
      pool.
      
      This way we get almost of the benefits of the HW RNG without any
      potential liabilities.  The only benefits we forgo is the
      speed/performance enhancements --- and generic kernel code can't
      depend on depend on get_random_bytes() having the speed of a HW RNG
      anyway.
      
      For those places that really want access to the arch-specific HW RNG,
      if it is available, we provide get_random_bytes_arch().
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      efe6c422
    • Theodore Ts'o's avatar
      random: use the arch-specific rng in xfer_secondary_pool · b2b6f120
      Theodore Ts'o authored
      commit e6d4947b upstream.
      
      If the CPU supports a hardware random number generator, use it in
      xfer_secondary_pool(), where it will significantly improve things and
      where we can afford it.
      
      Also, remove the use of the arch-specific rng in
      add_timer_randomness(), since the call is significantly slower than
      get_cycles(), and we're much better off using it in
      xfer_secondary_pool() anyway.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2b6f120
    • Theodore Ts'o's avatar
      net: feed /dev/random with the MAC address when registering a device · 118f98c7
      Theodore Ts'o authored
      commit 7bf23575 upstream.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: David Miller <davem@davemloft.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      118f98c7
    • Theodore Ts'o's avatar
      usb: feed USB device information to the /dev/random driver · 52d1114f
      Theodore Ts'o authored
      commit b04b3156 upstream.
      
      Send the USB device's serial, product, and manufacturer strings to the
      /dev/random driver to help seed its pools.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarGreg KH <greg@kroah.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      52d1114f
    • Linus Torvalds's avatar
      random: create add_device_randomness() interface · 3e035335
      Linus Torvalds authored
      commit a2080a67 upstream.
      
      Add a new interface, add_device_randomness() for adding data to the
      random pool that is likely to differ between two devices (or possibly
      even per boot).  This would be things like MAC addresses or serial
      numbers, or the read-out of the RTC. This does *not* add any actual
      entropy to the pool, but it initializes the pool to different values
      for devices that might otherwise be identical and have very little
      entropy available to them (particularly common in the embedded world).
      
      [ Modified by tytso to mix in a timestamp, since there may be some
        variability caused by the time needed to detect/configure the hardware
        in question. ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3e035335
    • Theodore Ts'o's avatar
      random: use lockless techniques in the interrupt path · ebb6006e
      Theodore Ts'o authored
      commit 902c098a upstream.
      
      The real-time Linux folks don't like add_interrupt_randomness() taking
      a spinlock since it is called in the low-level interrupt routine.
      This also allows us to reduce the overhead in the fast path, for the
      random driver, which is the interrupt collection path.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ebb6006e
    • Theodore Ts'o's avatar
      random: make 'add_interrupt_randomness()' do something sane · aa88dea2
      Theodore Ts'o authored
      commit 775f4b29 upstream.
      
      We've been moving away from add_interrupt_randomness() for various
      reasons: it's too expensive to do on every interrupt, and flooding the
      CPU with interrupts could theoretically cause bogus floods of entropy
      from a somewhat externally controllable source.
      
      This solves both problems by limiting the actual randomness addition
      to just once a second or after 64 interrupts, whicever comes first.
      During that time, the interrupt cycle data is buffered up in a per-cpu
      pool.  Also, we make sure the the nonblocking pool used by urandom is
      initialized before we start feeding the normal input pool.  This
      assures that /dev/urandom is returning unpredictable data as soon as
      possible.
      
      (Based on an original patch by Linus, but significantly modified by
      tytso.)
      Tested-by: default avatarEric Wustrow <ewust@umich.edu>
      Reported-by: default avatarEric Wustrow <ewust@umich.edu>
      Reported-by: default avatarNadia Heninger <nadiah@cs.ucsd.edu>
      Reported-by: default avatarZakir Durumeric <zakir@umich.edu>
      Reported-by: default avatarJ. Alex Halderman <jhalderm@umich.edu>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aa88dea2
    • Mathieu Desnoyers's avatar
      drivers/char/random.c: fix boot id uniqueness race · f5a1367c
      Mathieu Desnoyers authored
      commit 44e4360f upstream.
      
      /proc/sys/kernel/random/boot_id can be read concurrently by userspace
      processes.  If two (or more) user-space processes concurrently read
      boot_id when sysctl_bootid is not yet assigned, a race can occur making
      boot_id differ between the reads.  Because the whole point of the boot id
      is to be unique across a kernel execution, fix this by protecting this
      operation with a spinlock.
      
      Given that this operation is not frequently used, hitting the spinlock
      on each call should not be an issue.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f5a1367c
    • H. Peter Anvin's avatar
      random: Adjust the number of loops when initializing · a5914eb0
      H. Peter Anvin authored
      commit 2dac8e54 upstream.
      
      When we are initializing using arch_get_random_long() we only need to
      loop enough times to touch all the bytes in the buffer; using
      poolwords for that does twice the number of operations necessary on a
      64-bit machine, since in the random number generator code "word" means
      32 bits.
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Link: http://lkml.kernel.org/r/1324589281-31931-1-git-send-email-tytso@mit.eduSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a5914eb0
    • Theodore Ts'o's avatar
      random: Use arch-specific RNG to initialize the entropy store · d191959f
      Theodore Ts'o authored
      commit 3e88bdff upstream.
      
      If there is an architecture-specific random number generator (such as
      RDRAND for Intel architectures), use it to initialize /dev/random's
      entropy stores.  Even in the worst case, if RDRAND is something like
      AES(NSA_KEY, counter++), it won't hurt, and it will definitely help
      against any other adversaries.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Link: http://lkml.kernel.org/r/1324589281-31931-1-git-send-email-tytso@mit.eduSigned-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d191959f
    • Linus Torvalds's avatar
      random: Use arch_get_random_int instead of cycle counter if avail · be0052b8
      Linus Torvalds authored
      commit cf833d0b upstream.
      
      We still don't use rdrand in /dev/random, which just seems stupid. We
      accept the *cycle*counter* as a random input, but we don't accept
      rdrand? That's just broken.
      
      Sure, people can do things in user space (write to /dev/random, use
      rdrand in addition to /dev/random themselves etc etc), but that
      *still* seems to be a particularly stupid reason for saying "we
      shouldn't bother to try to do better in /dev/random".
      
      And even if somebody really doesn't trust rdrand as a source of random
      bytes, it seems singularly stupid to trust the cycle counter *more*.
      
      So I'd suggest the attached patch. I'm not going to even bother
      arguing that we should add more bits to the entropy estimate, because
      that's not the point - I don't care if /dev/random fills up slowly or
      not, I think it's just stupid to not use the bits we can get from
      rdrand and mix them into the strong randomness pool.
      
      Link: http://lkml.kernel.org/r/CA%2B55aFwn59N1=m651QAyTy-1gO1noGbK18zwKDwvwqnravA84A@mail.gmail.comAcked-by: default avatar"David S. Miller" <davem@davemloft.net>
      Acked-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      be0052b8
    • Luck, Tony's avatar
      fix typo/thinko in get_random_bytes() · 21a465d5
      Luck, Tony authored
      commit bd29e568 upstream.
      
      If there is an architecture-specific random number generator we use it
      to acquire randomness one "long" at a time.  We should put these random
      words into consecutive words in the result buffer - not just overwrite
      the first word again and again.
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Acked-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      21a465d5
    • H. Peter Anvin's avatar
      random: Add support for architectural random hooks · 6133313b
      H. Peter Anvin authored
      commit 63d77173 upstream.
      
      Add support for architecture-specific hooks into the kernel-directed
      random number generator interfaces.  This patchset does not use the
      architecture random number generator interfaces for the
      userspace-directed interfaces (/dev/random and /dev/urandom), thus
      eliminating the need to distinguish between them based on a pool
      pointer.
      
      Changes in version 3:
      - Moved the hooks from extract_entropy() to get_random_bytes().
      - Changes the hooks to inlines.
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6133313b
    • Alan Cox's avatar
      x86, nops: Missing break resulting in incorrect selection on Intel · 7b1cad62
      Alan Cox authored
      commit d6250a3f upstream.
      
      The Intel case falls through into the generic case which then changes
      the values.  For cases like the P6 it doesn't do the right thing so
      this seems to be a screwup.
      Signed-off-by: default avatarAlan Cox <alan@linux.intel.com>
      Link: http://lkml.kernel.org/n/tip-lww2uirad4skzjlmrm0vru8o@git.kernel.orgSigned-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7b1cad62
    • Johannes Berg's avatar
      mac80211: cancel mesh path timer · 61e0a9e7
      Johannes Berg authored
      commit dd4c9260 upstream.
      
      The mesh path timer needs to be canceled when
      leaving the mesh as otherwise it could fire
      after the interface has been removed already.
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      61e0a9e7
    • Xiao Guangrong's avatar
      mm: mmu_notifier: fix freed page still mapped in secondary MMU · 8bda26e3
      Xiao Guangrong authored
      commit 3ad3d901 upstream.
      
      mmu_notifier_release() is called when the process is exiting.  It will
      delete all the mmu notifiers.  But at this time the page belonging to the
      process is still present in page tables and is present on the LRU list, so
      this race will happen:
      
            CPU 0                 CPU 1
      mmu_notifier_release:    try_to_unmap:
         hlist_del_init_rcu(&mn->hlist);
                                  ptep_clear_flush_notify:
                                        mmu nofifler not found
                                  free page  !!!!!!
                                  /*
                                   * At the point, the page has been
                                   * freed, but it is still mapped in
                                   * the secondary MMU.
                                   */
      
        mn->ops->release(mn, mm);
      
      Then the box is not stable and sometimes we can get this bug:
      
      [  738.075923] BUG: Bad page state in process migrate-perf  pfn:03bec
      [  738.075931] page:ffffea00000efb00 count:0 mapcount:0 mapping:          (null) index:0x8076
      [  738.075936] page flags: 0x20000000000014(referenced|dirty)
      
      The same issue is present in mmu_notifier_unregister().
      
      We can call ->release before deleting the notifier to ensure the page has
      been unmapped from the secondary MMU before it is freed.
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8bda26e3
    • Will Deacon's avatar
      ARM: 7479/1: mm: avoid NULL dereference when flushing gate_vma with VIVT caches · b9d316de
      Will Deacon authored
      commit b74253f7 upstream.
      
      The vivt_flush_cache_{range,page} functions check that the mm_struct
      of the VMA being flushed has been active on the current CPU before
      performing the cache maintenance.
      
      The gate_vma has a NULL mm_struct pointer and, as such, will cause a
      kernel fault if we try to flush it with the above operations. This
      happens during ELF core dumps, which include the gate_vma as it may be
      useful for debugging purposes.
      
      This patch adds checks to the VIVT cache flushing functions so that VMAs
      with a NULL mm_struct are flushed unconditionally (the vectors page may
      be dirty if we use it to store the current TLS pointer).
      Reported-by: default avatarGilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
      Tested-by: default avatarUros Bizjak <ubizjak@gmail.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b9d316de
    • Will Deacon's avatar
      ARM: 7478/1: errata: extend workaround for erratum #720789 · 0b41a531
      Will Deacon authored
      commit 5a783cbc upstream.
      
      Commit cdf357f1 ("ARM: 6299/1: errata: TLBIASIDIS and TLBIMVAIS
      operations can broadcast a faulty ASID") replaced by-ASID TLB flushing
      operations with all-ASID variants to workaround A9 erratum #720789.
      
      This patch extends the workaround to include the tlb_range operations,
      which were overlooked by the original patch.
      Tested-by: default avatarSteve Capper <steve.capper@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0b41a531
    • Joonsoo Kim's avatar
      mm: fix wrong argument of migrate_huge_pages() in soft_offline_huge_page() · cad33da5
      Joonsoo Kim authored
      commit dc32f634 upstream.
      
      Commit a6bc32b8 ("mm: compaction: introduce sync-light migration for
      use by compaction") changed the declaration of migrate_pages() and
      migrate_huge_pages().
      
      But it missed changing the argument of migrate_huge_pages() in
      soft_offline_huge_page().  In this case, we should call
      migrate_huge_pages() with MIGRATE_SYNC.
      
      Additionally, there is a mismatch between type the of argument and the
      function declaration for migrate_pages().
      Signed-off-by: default avatarJoonsoo Kim <js1304@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cad33da5
    • Greg Pearson's avatar
      pcdp: use early_ioremap/early_iounmap to access pcdp table · 83c0c5e4
      Greg Pearson authored
      commit 6c4088ac upstream.
      
      efi_setup_pcdp_console() is called during boot to parse the HCDP/PCDP
      EFI system table and setup an early console for printk output.  The
      routine uses ioremap/iounmap to setup access to the HCDP/PCDP table
      information.
      
      The call to ioremap is happening early in the boot process which leads
      to a panic on x86_64 systems:
      
          panic+0x01ca
          do_exit+0x043c
          oops_end+0x00a7
          no_context+0x0119
          __bad_area_nosemaphore+0x0138
          bad_area_nosemaphore+0x000e
          do_page_fault+0x0321
          page_fault+0x0020
          reserve_memtype+0x02a1
          __ioremap_caller+0x0123
          ioremap_nocache+0x0012
          efi_setup_pcdp_console+0x002b
          setup_arch+0x03a9
          start_kernel+0x00d4
          x86_64_start_reservations+0x012c
          x86_64_start_kernel+0x00fe
      
      This replaces the calls to ioremap/iounmap in efi_setup_pcdp_console()
      with calls to early_ioremap/early_iounmap which can be called during
      early boot.
      
      This patch was tested on an x86_64 prototype system which uses the
      HCDP/PCDP table for early console setup.
      Signed-off-by: default avatarGreg Pearson <greg.pearson@hp.com>
      Acked-by: default avatarKhalid Aziz <khalid.aziz@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      83c0c5e4
    • Ryusuke Konishi's avatar
      nilfs2: fix deadlock issue between chcp and thaw ioctls · 85e937dc
      Ryusuke Konishi authored
      commit 572d8b39 upstream.
      
      An fs-thaw ioctl causes deadlock with a chcp or mkcp -s command:
      
       chcp            D ffff88013870f3d0     0  1325   1324 0x00000004
       ...
       Call Trace:
         nilfs_transaction_begin+0x11c/0x1a0 [nilfs2]
         wake_up_bit+0x20/0x20
         copy_from_user+0x18/0x30 [nilfs2]
         nilfs_ioctl_change_cpmode+0x7d/0xcf [nilfs2]
         nilfs_ioctl+0x252/0x61a [nilfs2]
         do_page_fault+0x311/0x34c
         get_unmapped_area+0x132/0x14e
         do_vfs_ioctl+0x44b/0x490
         __set_task_blocked+0x5a/0x61
         vm_mmap_pgoff+0x76/0x87
         __set_current_blocked+0x30/0x4a
         sys_ioctl+0x4b/0x6f
         system_call_fastpath+0x16/0x1b
       thaw            D ffff88013870d890     0  1352   1351 0x00000004
       ...
       Call Trace:
         rwsem_down_failed_common+0xdb/0x10f
         call_rwsem_down_write_failed+0x13/0x20
         down_write+0x25/0x27
         thaw_super+0x13/0x9e
         do_vfs_ioctl+0x1f5/0x490
         vm_mmap_pgoff+0x76/0x87
         sys_ioctl+0x4b/0x6f
         filp_close+0x64/0x6c
         system_call_fastpath+0x16/0x1b
      
      where the thaw ioctl deadlocked at thaw_super() when called while chcp was
      waiting at nilfs_transaction_begin() called from
      nilfs_ioctl_change_cpmode().  This deadlock is 100% reproducible.
      
      This is because nilfs_ioctl_change_cpmode() first locks sb->s_umount in
      read mode and then waits for unfreezing in nilfs_transaction_begin(),
      whereas thaw_super() locks sb->s_umount in write mode.  The locking of
      sb->s_umount here was intended to make snapshot mounts and the downgrade
      of snapshots to checkpoints exclusive.
      
      This fixes the deadlock issue by replacing the sb->s_umount usage in
      nilfs_ioctl_change_cpmode() with a dedicated mutex which protects snapshot
      mounts.
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      85e937dc
    • Stanislav Kinsbursky's avatar
      SUNRPC: return negative value in case rpcbind client creation error · d90c97ba
      Stanislav Kinsbursky authored
      commit caea33da upstream.
      
      Without this patch kernel will panic on LockD start, because lockd_up() checks
      lockd_up_net() result for negative value.
      From my pow it's better to return negative value from rpcbind routines instead
      of replacing all such checks like in lockd_up().
      Signed-off-by: default avatarStanislav Kinsbursky <skinsbursky@parallels.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d90c97ba
    • Tony Luck's avatar
      Redefine ATOMIC_INIT and ATOMIC64_INIT to drop the casts · 5bf75ed6
      Tony Luck authored
      commit a1193655 upstream.
      
      The following build error occured during a ia64 build with
      swap-over-NFS patches applied.
      
      net/core/sock.c:274:36: error: initializer element is not constant
      net/core/sock.c:274:36: error: (near initialization for 'memalloc_socks')
      net/core/sock.c:274:36: error: initializer element is not constant
      
      This is identical to a parisc build error. Fengguang Wu, Mel Gorman
      and James Bottomley did all the legwork to track the root cause of
      the problem. This fix and entire commit log is shamelessly copied
      from them with one extra detail to change a dubious runtime use of
      ATOMIC_INIT() to atomic_set() in drivers/char/mspec.c
      
      Dave Anglin says:
      > Here is the line in sock.i:
      >
      > struct static_key memalloc_socks = ((struct static_key) { .enabled =
      > ((atomic_t) { (0) }) });
      
      The above line contains two compound literals.  It also uses a designated
      initializer to initialize the field enabled.  A compound literal is not a
      constant expression.
      
      The location of the above statement isn't fully clear, but if a compound
      literal occurs outside the body of a function, the initializer list must
      consist of constant expressions.
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5bf75ed6
    • Kevin Winchester's avatar
      x86: Simplify code by removing a !SMP #ifdefs from 'struct cpuinfo_x86' · 22268214
      Kevin Winchester authored
      commit 141168c3 and
      commit 3f806e50 upstream.
      
      Several fields in struct cpuinfo_x86 were not defined for the
      !SMP case, likely to save space.  However, those fields still
      have some meaning for UP, and keeping them allows some #ifdef
      removal from other files.  The additional size of the UP kernel
      from this change is not significant enough to worry about
      keeping up the distinction:
      
      	   text    data     bss     dec     hex filename
      	4737168	 506459	 972040	6215667	 5ed7f3	vmlinux.o.before
      	4737444	 506459	 972040	6215943	 5ed907	vmlinux.o.after
      
      for a difference of 276 bytes for an example UP config.
      
      If someone wants those 276 bytes back badly then it should
      be implemented in a cleaner way.
      Signed-off-by: default avatarKevin Winchester <kjwinchester@gmail.com>
      Cc: Steffen Persvold <sp@numascale.com>
      Link: http://lkml.kernel.org/r/1324428742-12498-1-git-send-email-kjwinchester@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarBorislav Petkov <borislav.petkov@amd.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      22268214
  2. 09 Aug, 2012 5 commits