1. 10 Jun, 2014 15 commits
    • Filipe Manana's avatar
      Btrfs: implement inode_operations callback tmpfile · ef3b9af5
      Filipe Manana authored
      This implements the tmpfile callback of struct inode_operations, introduced
      in the linux kernel 3.11, and implemented already by some filesystems. This
      callback is invoked by the VFS when the flag O_TMPFILE is passed to the open
      system call.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      ef3b9af5
    • David Sterba's avatar
      btrfs: make FS_INFO ioctl available to anyone · e4ef90ff
      David Sterba authored
      This ioctl provides basic info about the filesystem that can be obtained
      in other ways (eg. sysfs), there's no reason to restrict it to
      CAP_SYSADMIN.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e4ef90ff
    • David Sterba's avatar
      btrfs: make DEV_INFO ioctl available to anyone · 7d6213c5
      David Sterba authored
      This ioctl provides basic info about the devices that can be obtained in
      other ways (eg. sysfs), there's no reason to restrict it to
      CAP_SYSADMIN.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      7d6213c5
    • David Sterba's avatar
      btrfs: export more from FS_INFO to sysfs · df93589a
      David Sterba authored
      Similar to the FS_INFO updates, export the basic filesystem info through
      sysfs: node size, sector size and clone alignment.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      df93589a
    • David Sterba's avatar
      btrfs: retrieve more info from FS_INFO ioctl · 80a773fb
      David Sterba authored
      Provide the basic information about filesystem through the ioctl:
      * b-tree node size (same as leaf size)
      * sector size
      * expected alignment of CLONE_RANGE and EXTENT_SAME ioctl arguments
      
      Backward compatibility: if the values are 0, kernel does not provide
      this information, the applications should ignore them.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      80a773fb
    • David Sterba's avatar
      btrfs: balance filter: add limit of processed chunks · 7d824b6f
      David Sterba authored
      This started as debugging helper, to watch the effects of converting
      between raid levels on multiple devices, but could be useful standalone.
      
      In my case the usage filter was not finegrained enough and led to
      converting too many chunks at once. Another example use is in connection
      with drange+devid or vrange filters that allow to work with a specific
      chunk or even with a chunk on a given device.
      
      The limit filter applies last, the value of 0 means no limiting.
      
      CC: Ilya Dryomov <idryomov@gmail.com>
      CC: Hugo Mills <hugo@carfax.org.uk>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      7d824b6f
    • Filipe Manana's avatar
      Btrfs: fix leaf corruption caused by ENOSPC while hole punching · fc19c5e7
      Filipe Manana authored
      While running a stress test with multiple threads writing to the same btrfs
      file system, I ended up with a situation where a leaf was corrupted in that
      it had 2 file extent item keys that had the same exact key. I was able to
      detect this quickly thanks to the following patch which triggers an assertion
      as soon as a leaf is marked dirty if there are duplicated keys or out of order
      keys:
      
          Btrfs: check if items are ordered when a leaf is marked dirty
          (https://patchwork.kernel.org/patch/3955431/)
      
      Basically while running the test, I got the following in dmesg:
      
          [28877.415877] WARNING: CPU: 2 PID: 10706 at fs/btrfs/file.c:553 btrfs_drop_extent_cache+0x435/0x440 [btrfs]()
          (...)
          [28877.415917] Call Trace:
          [28877.415922]  [<ffffffff816f1189>] dump_stack+0x4e/0x68
          [28877.415926]  [<ffffffff8104a32c>] warn_slowpath_common+0x8c/0xc0
          [28877.415929]  [<ffffffff8104a37a>] warn_slowpath_null+0x1a/0x20
          [28877.415944]  [<ffffffffa03775a5>] btrfs_drop_extent_cache+0x435/0x440 [btrfs]
          [28877.415949]  [<ffffffff8118e7be>] ? kmem_cache_alloc+0xfe/0x1c0
          [28877.415962]  [<ffffffffa03777d9>] fill_holes+0x229/0x3e0 [btrfs]
          [28877.415972]  [<ffffffffa0345865>] ? block_rsv_add_bytes+0x55/0x80 [btrfs]
          [28877.415984]  [<ffffffffa03792cb>] btrfs_fallocate+0xb6b/0xc20 [btrfs]
          (...)
          [29854.132560] BTRFS critical (device sdc): corrupt leaf, bad key order: block=955232256,root=1, slot=24
          [29854.132565] BTRFS info (device sdc): leaf 955232256 total ptrs 40 free space 778
          (...)
          [29854.132637] 	item 23 key (3486 108 667648) itemoff 2694 itemsize 53
          [29854.132638] 		extent data disk bytenr 14574411776 nr 286720
          [29854.132639] 		extent data offset 0 nr 286720 ram 286720
          [29854.132640] 	item 24 key (3486 108 954368) itemoff 2641 itemsize 53
          [29854.132641] 		extent data disk bytenr 0 nr 0
          [29854.132643] 		extent data offset 0 nr 0 ram 0
          [29854.132644] 	item 25 key (3486 108 954368) itemoff 2588 itemsize 53
          [29854.132645] 		extent data disk bytenr 8699670528 nr 77824
          [29854.132646] 		extent data offset 0 nr 77824 ram 77824
          [29854.132647] 	item 26 key (3486 108 1146880) itemoff 2535 itemsize 53
          [29854.132648] 		extent data disk bytenr 8699670528 nr 77824
          [29854.132649] 		extent data offset 0 nr 77824 ram 77824
          (...)
          [29854.132707] kernel BUG at fs/btrfs/ctree.h:3901!
          (...)
          [29854.132771] Call Trace:
          [29854.132779]  [<ffffffffa0342b5c>] setup_items_for_insert+0x2dc/0x400 [btrfs]
          [29854.132791]  [<ffffffffa0378537>] __btrfs_drop_extents+0xba7/0xdd0 [btrfs]
          [29854.132794]  [<ffffffff8109c0d6>] ? trace_hardirqs_on_caller+0x16/0x1d0
          [29854.132797]  [<ffffffff8109c29d>] ? trace_hardirqs_on+0xd/0x10
          [29854.132800]  [<ffffffff8118e7be>] ? kmem_cache_alloc+0xfe/0x1c0
          [29854.132810]  [<ffffffffa036783b>] insert_reserved_file_extent.constprop.66+0xab/0x310 [btrfs]
          [29854.132820]  [<ffffffffa036a6c6>] __btrfs_prealloc_file_range+0x116/0x340 [btrfs]
          [29854.132830]  [<ffffffffa0374d53>] btrfs_prealloc_file_range+0x23/0x30 [btrfs]
          (...)
      
      So this is caused by getting an -ENOSPC error while punching a file hole, more
      specifically, we get -ENOSPC error from __btrfs_drop_extents in the while loop
      of file.c:btrfs_punch_hole() when it's unable to modify the btree to delete one
      or more file extent items due to lack of enough free space. When this happens,
      in btrfs_punch_hole(), we attempt to reclaim free space by switching our transaction
      block reservation object to root->fs_info->trans_block_rsv, end our transaction and
      start a new transaction basically - and, we keep increasing our current offset
      (cur_offset) as long as it's smaller than the end of the target range (lockend) -
      this makes use leave the loop with cur_offset == drop_end which in turn makes us
      call fill_holes() for inserting a file extent item that represents a 0 bytes range
      hole (and this insertion succeeds, as in the meanwhile more space became available).
      
      This 0 bytes file hole extent item is a problem because any subsequent caller of
      __btrfs_drop_extents (regular file writes, or fallocate calls for e.g.), with a
      start file offset that is equal to the offset of the hole, will not remove this
      extent item due to the following conditional in the while loop of
      __btrfs_drop_extents:
      
          if (extent_end <= search_start) {
                  path->slots[0]++;
                  goto next_slot;
          }
      
      This later makes the call to setup_items_for_insert() (at the very end of
      __btrfs_drop_extents), insert a new file extent item with the same offset as
      the 0 bytes file hole extent item that follows it. Needless is to say that this
      causes chaos, either when reading the leaf from disk (btree_readpage_end_io_hook),
      where we perform leaf sanity checks or in subsequent operations that manipulate
      file extent items, as in the fallocate call as shown by the dmesg trace above.
      
      Without my other patch to perform the leaf sanity checks once a leaf is marked
      as dirty (if the integrity checker is enabled), it would have been much harder
      to debug this issue.
      
      This change might fix a few similar issues reported by users in the mailing
      list regarding assertion failures in btrfs_set_item_key_safe calls performed
      by __btrfs_drop_extents, such as the following report:
      
          http://comments.gmane.org/gmane.comp.file-systems.btrfs/32938
      
      Asking fill_holes() to create a 0 bytes wide file hole item also produced the
      first warning in the trace above, as we passed a range to btrfs_drop_extent_cache
      that has an end smaller (by -1) than its start.
      
      On 3.14 kernels this issue manifests itself through leaf corruption, as we get
      duplicated file extent item keys in a leaf when calling setup_items_for_insert(),
      but on older kernels, setup_items_for_insert() isn't called by __btrfs_drop_extents(),
      instead we have callers of __btrfs_drop_extents(), namely the functions
      inode.c:insert_inline_extent() and inode.c:insert_reserved_file_extent(), calling
      btrfs_insert_empty_item() to insert the new file extent item, which would fail with
      error -EEXIST, instead of inserting a duplicated key - which is still a serious
      issue as it would make all similar file extent item replace operations keep
      failing if they target the same file range.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      fc19c5e7
    • Liu Bo's avatar
      Btrfs: do not increment on bio_index one by one · d2cbf2a2
      Liu Bo authored
      'bio_index' is just a index, it's really not necessary to do increment
      one by one.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d2cbf2a2
    • Filipe Manana's avatar
      Btrfs: read inode size after acquiring the mutex when punching a hole · a1a50f60
      Filipe Manana authored
      In a previous change, commit 12870f1c,
      I accidentally moved the roundup of inode->i_size to outside of the
      critical section delimited by the inode mutex, which is not atomic and
      not correct since the size can be changed by other task before we acquire
      the mutex. Therefore fix it.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a1a50f60
    • Tobias Klauser's avatar
      btrfs: Remove unnecessary check for NULL · 7fb18a06
      Tobias Klauser authored
      iput() already checks for the inode being NULL, thus it's unnecessary to
      check before calling.
      Signed-off-by: default avatarTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      7fb18a06
    • Zach Brown's avatar
      btrfs: fix inline compressed read err corruption · 166ae5a4
      Zach Brown authored
      uncompress_inline() is dropping the error from btrfs_decompress() after
      testing it and zeroing the page that was supposed to hold decompressed
      data.  This can silently turn compressed inline data in to zeros if
      decompression fails due to corrupt compressed data or memory allocation
      failure.
      
      I verified this by manually forcing the error from btrfs_decompress()
      for a silly named copy of od:
      
      	if (!strcmp(current->comm, "failod"))
      		ret = -ENOMEM;
      
        # od -x /mnt/btrfs/dir/80 | head -1
        0000000 3031 3038 310a 2d30 6f70 6e69 0a74 3031
        # echo 3 > /proc/sys/vm/drop_caches
        # cp $(which od) /tmp/failod
        # /tmp/failod -x /mnt/btrfs/dir/80 | head -1
        0000000 0000 0000 0000 0000 0000 0000 0000 0000
      
      The fix is to pass the error to its caller.  Which still has a BUG_ON().
      So we fix that too.
      
      There seems to be no reason for the zeroing of the page on the error
      from btrfs_decompress() but not from the allocation error a few lines
      above.  So the page zeroing is removed.
      Signed-off-by: default avatarZach Brown <zab@redhat.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      166ae5a4
    • Zach Brown's avatar
      btrfs: return ptr error from compression workspace · 774bcb35
      Zach Brown authored
      The btrfs compression wrappers translated errors from workspace
      allocation to either -ENOMEM or -1.  The compression type workspace
      allocators are already returning a ERR_PTR(-ENOMEM).  Just return that
      and get rid of the magical -1.
      
      This helps a future patch return errors from the compression wrappers.
      Signed-off-by: default avatarZach Brown <zab@redhat.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      774bcb35
    • Zach Brown's avatar
      btrfs: return errno instead of -1 from compression · 60e1975a
      Zach Brown authored
      The compression layer seems to have been built to return -1 and have
      callers make up errors that make sense.  This isn't great because there
      are different errors that originate down in the compression layer.
      
      Let's return real negative errnos from the compression layer so that
      callers can pass on the error without having to guess what happened.
      ENOMEM for allocation failure, E2BIG when compression exceeds the
      uncompressed input, and EIO for everything else.
      
      This helps a future path return errors from btrfs_decompress().
      Signed-off-by: default avatarZach Brown <zab@redhat.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      60e1975a
    • Stefan Behrens's avatar
      btrfs: check_int: propagate out-of-memory error upwards · 98806b44
      Stefan Behrens authored
      This issue was not causing any harm but IMO (and in the opinion of the
      static code checker) it is better to propagate this error status upwards.
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      98806b44
    • Filipe Manana's avatar
      Btrfs: fix hang on error (such as ENOSPC) when writing extent pages · 61391d56
      Filipe Manana authored
      When running low on available disk space and having several processes
      doing buffered file IO, I got the following trace in dmesg:
      
      [ 4202.720152] INFO: task kworker/u8:1:5450 blocked for more than 120 seconds.
      [ 4202.720401]       Not tainted 3.13.0-fdm-btrfs-next-26+ #1
      [ 4202.720596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 4202.720874] kworker/u8:1    D 0000000000000001     0  5450      2 0x00000000
      [ 4202.720904] Workqueue: btrfs-flush_delalloc normal_work_helper [btrfs]
      [ 4202.720908]  ffff8801f62ddc38 0000000000000082 ffff880203ac2490 00000000001d3f40
      [ 4202.720913]  ffff8801f62ddfd8 00000000001d3f40 ffff8800c4f0c920 ffff880203ac2490
      [ 4202.720918]  00000000001d4a40 ffff88020fe85a40 ffff88020fe85ab8 0000000000000001
      [ 4202.720922] Call Trace:
      [ 4202.720931]  [<ffffffff816a3cb9>] schedule+0x29/0x70
      [ 4202.720950]  [<ffffffffa01ec48d>] btrfs_start_ordered_extent+0x6d/0x110 [btrfs]
      [ 4202.720956]  [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
      [ 4202.720972]  [<ffffffffa01ec559>] btrfs_run_ordered_extent_work+0x29/0x40 [btrfs]
      [ 4202.720988]  [<ffffffffa0201987>] normal_work_helper+0x137/0x2c0 [btrfs]
      [ 4202.720994]  [<ffffffff810680e5>] process_one_work+0x1f5/0x530
      (...)
      [ 4202.721027] 2 locks held by kworker/u8:1/5450:
      [ 4202.721028]  #0:  (%s-%s){++++..}, at: [<ffffffff81068083>] process_one_work+0x193/0x530
      [ 4202.721037]  #1:  ((&work->normal_work)){+.+...}, at: [<ffffffff81068083>] process_one_work+0x193/0x530
      [ 4202.721054] INFO: task btrfs:7891 blocked for more than 120 seconds.
      [ 4202.721258]       Not tainted 3.13.0-fdm-btrfs-next-26+ #1
      [ 4202.721444] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 4202.721699] btrfs           D 0000000000000001     0  7891   7890 0x00000001
      [ 4202.721704]  ffff88018c2119e8 0000000000000086 ffff8800a33d2490 00000000001d3f40
      [ 4202.721710]  ffff88018c211fd8 00000000001d3f40 ffff8802144b0000 ffff8800a33d2490
      [ 4202.721714]  ffff8800d8576640 ffff88020fe85bc0 ffff88020fe85bc8 7fffffffffffffff
      [ 4202.721718] Call Trace:
      [ 4202.721723]  [<ffffffff816a3cb9>] schedule+0x29/0x70
      [ 4202.721727]  [<ffffffff816a2ebc>] schedule_timeout+0x1dc/0x270
      [ 4202.721732]  [<ffffffff8109bd79>] ? mark_held_locks+0xb9/0x140
      [ 4202.721736]  [<ffffffff816a90c0>] ? _raw_spin_unlock_irq+0x30/0x40
      [ 4202.721740]  [<ffffffff8109bf0d>] ? trace_hardirqs_on_caller+0x10d/0x1d0
      [ 4202.721744]  [<ffffffff816a488f>] wait_for_completion+0xdf/0x120
      [ 4202.721749]  [<ffffffff8107fa90>] ? try_to_wake_up+0x310/0x310
      [ 4202.721765]  [<ffffffffa01ebee4>] btrfs_wait_ordered_extents+0x1f4/0x280 [btrfs]
      [ 4202.721781]  [<ffffffffa020526e>] btrfs_mksubvol.isra.62+0x30e/0x5a0 [btrfs]
      [ 4202.721786]  [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
      [ 4202.721799]  [<ffffffffa02056a9>] btrfs_ioctl_snap_create_transid+0x1a9/0x1b0 [btrfs]
      [ 4202.721813]  [<ffffffffa020583a>] btrfs_ioctl_snap_create_v2+0x10a/0x170 [btrfs]
      (...)
      
      It turns out that extent_io.c:__extent_writepage(), which ends up being called
      through filemap_fdatawrite_range() in btrfs_start_ordered_extent(), was getting
      -ENOSPC when calling the fill_delalloc callback. In this situation, it returned
      without the writepage_end_io_hook callback (inode.c:btrfs_writepage_end_io_hook)
      ever being called for the respective page, which prevents the ordered extent's
      bytes_left count from ever reaching 0, and therefore a finish_ordered_fn work
      is never queued into the endio_write_workers queue. This makes the task that
      called btrfs_start_ordered_extent() hang forever on the wait queue of the ordered
      extent.
      
      This is fairly easy to reproduce using a small filesystem and fsstress on
      a quad core vm:
      
          mkfs.btrfs -f -b `expr 2100 \* 1024 \* 1024` /dev/sdd
          mount /dev/sdd /mnt
      
          fsstress -p 6 -d /mnt -n 100000 -x \
              "btrfs subvolume snapshot -r /mnt /mnt/mysnap" \
      	    -f allocsp=0 \
      	    -f bulkstat=0 \
      	    -f bulkstat1=0 \
      	    -f chown=0 \
      	    -f creat=1 \
      	    -f dread=0 \
      	    -f dwrite=0 \
      	    -f fallocate=1 \
      	    -f fdatasync=0 \
      	    -f fiemap=0 \
      	    -f freesp=0 \
      	    -f fsync=0 \
      	    -f getattr=0 \
      	    -f getdents=0 \
      	    -f link=0 \
      	    -f mkdir=0 \
      	    -f mknod=0 \
      	    -f punch=1 \
      	    -f read=0 \
      	    -f readlink=0 \
      	    -f rename=0 \
      	    -f resvsp=0 \
      	    -f rmdir=0 \
      	    -f setxattr=0 \
      	    -f stat=0 \
      	    -f symlink=0 \
      	    -f sync=0 \
      	    -f truncate=1 \
      	    -f unlink=0 \
      	    -f unresvsp=0 \
      	    -f write=4
      
      So just ensure that if an error happens while writing the extent page
      we call the writepage_end_io_hook callback. Also make it return the
      error code and ensure the caller (extent_write_cache_pages) processes
      all pages in the page vector even if an error happens only for some
      of them, so that ordered extents end up released.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      61391d56
  2. 08 Jun, 2014 2 commits
  3. 07 Jun, 2014 4 commits
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs · c593e897
      Linus Torvalds authored
      Pull btrfs fix from Chris Mason:
       "I had this in my 3.16 merge window queue, but it is small and obvious
        enough for 3.15.  I cherry-picked and retested against current rc8"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
        Btrfs: send, fix corrupted path strings for long paths
      c593e897
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending · 052e5c7e
      Linus Torvalds authored
      Pull SCSI target fixes from Nicholas Bellinger:
       "Here are the remaining fixes for v3.15.
      
        This series includes:
      
         - iser-target fix for ImmediateData exception reference count bug
           (Sagi + nab)
         - iscsi-target fix for MC/S login + potential iser-target MRDSL
           buffer overrun (Santosh + Roland)
         - iser-target fix for v3.15-rc multi network portal shutdown
           regression (nab)
         - target fix for allowing READ_CAPCITY during ALUA Standby access
           state (Chris + nab)
         - target fix for NULL pointer dereference of alua_access_state for
           un-configured devices (Chris + nab)"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
        target: Fix alua_access_state attribute OOPs for un-configured devices
        target: Allow READ_CAPACITY opcode in ALUA Standby access state
        iser-target: Fix multi network portal shutdown regression
        iscsi-target: Fix wrong buffer / buffer overrun in iscsi_change_param_value()
        iser-target: Add missing target_put_sess_cmd for ImmedateData failure
      052e5c7e
    • Linus Torvalds's avatar
      Merge branch 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 813895f8
      Linus Torvalds authored
      Pull x86 fixes from Peter Anvin:
       "A significantly larger than I'd like set of patches for just below the
        wire.  All of these, however, fix real problems.
      
        The one thing that is genuinely scary in here is the change of SMP
        initialization, but that *does* fix a confirmed hang when booting
        virtual machines.
      
        There is also a patch to actually do the right thing about not
        offlining a CPU when there are not enough interrupt vectors available
        in the system; the accounting was done incorrectly.  The worst case
        for that patch is that we fail to offline CPUs when we should (the new
        code is strictly more conservative than the old), so is not
        particularly risky.
      
        Most of the rest is minor stuff; the EFI patches are all about
        exporting correct information to boot loaders and kexec"
      
      * 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/boot: EFI_MIXED should not prohibit loading above 4G
        x86/smpboot: Initialize secondary CPU only if master CPU will wait for it
        x86/smpboot: Log error on secondary CPU wakeup failure at ERR level
        x86: Fix list/memory corruption on CPU hotplug
        x86: irq: Get correct available vectors for cpu disable
        x86/efi: Do not export efi runtime map in case old map
        x86/efi: earlyprintk=efi,keep fix
      813895f8
    • Matt Fleming's avatar
      x86/boot: EFI_MIXED should not prohibit loading above 4G · 745c5167
      Matt Fleming authored
      commit 7d453eee ("x86/efi: Wire up CONFIG_EFI_MIXED") introduced a
      regression for the functionality to load kernels above 4G. The relevant
      (incorrect) reasoning behind this change can be seen in the commit
      message,
      
        "The xloadflags field in the bzImage header is also updated to reflect
        that the kernel supports both entry points by setting both of
        XLF_EFI_HANDOVER_32 and XLF_EFI_HANDOVER_64 when CONFIG_EFI_MIXED=y.
        XLF_CAN_BE_LOADED_ABOVE_4G is disabled so that the kernel text is
        guaranteed to be addressable with 32-bits."
      
      This is obviously bogus since 32-bit EFI loaders will never place the
      kernel above the 4G mark. So this restriction is entirely unnecessary.
      
      But things are worse than that - since we want to encourage people to
      always compile with CONFIG_EFI_MIXED=y so that their kernels work out of
      the box for both 32-bit and 64-bit firmware, commit 7d453eee
      effectively disables XLF_CAN_BE_LOADED_ABOVE_4G completely.
      
      Remove the overzealous and superfluous restriction and restore the
      XLF_CAN_BE_LOADED_ABOVE_4G functionality.
      
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarMatt Fleming <matt.fleming@intel.com>
      Link: http://lkml.kernel.org/r/1402140380-15377-1-git-send-email-matt@console-pimps.orgSigned-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      745c5167
  4. 06 Jun, 2014 6 commits
  5. 05 Jun, 2014 13 commits
    • H. Peter Anvin's avatar
      Merge tag 'efi-urgent' into x86/urgent · 17787542
      H. Peter Anvin authored
       * Fix earlyprintk=efi,keep support by switching to an ioremap() mapping
         of the framebuffer when early_ioremap() is no longer available and
         dropping __init from functions that may be invoked after
         free_initmem() - Dave Young
      
       * We shouldn't be exporting the EFI runtime map in sysfs if not using
         the new 1:1 EFI mapping code since in that case the mappings are not
         static across a kexec reboot - Dave Young
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      17787542
    • Linus Torvalds's avatar
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 951e2730
      Linus Torvalds authored
      Pull perf fixes from Ingo Molnar:
       "Two last minute tooling fixes"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf probe: Fix perf probe to find correct variable DIE
        perf probe: Fix a segfault if asked for variable it doesn't find
      951e2730
    • Linus Torvalds's avatar
      Merge branch 'futex-fixes' (futex fixes from Thomas Gleixner) · 1c5aefb5
      Linus Torvalds authored
      Merge futex fixes from Thomas Gleixner:
       "So with more awake and less futex wreckaged brain, I went through my
        list of points again and came up with the following 4 patches.
      
        1) Prevent pi requeueing on the same futex
      
           I kept Kees check for uaddr1 == uaddr2 as a early check for private
           futexes and added a key comparison to both futex_requeue and
           futex_wait_requeue_pi.
      
           Sebastian, sorry for the confusion yesterday night.  I really
           misunderstood your question.
      
           You are right the check is pointless for shared futexes where the
           same physical address is mapped to two different virtual addresses.
      
        2) Sanity check atomic acquisiton in futex_lock_pi_atomic
      
           That's basically what Darren suggested.
      
           I just simplified it to use futex_top_waiter() to find kernel
           internal state.  If state is found return -EINVAL and do not bother
           to fix up the user space variable.  It's corrupted already.
      
        3) Ensure state consistency in futex_unlock_pi
      
           The code is silly versus the owner died bit.  There is no point to
           preserve it on unlock when the user space thread owns the futex.
      
           What's worse is that it does not update the user space value when
           the owner died bit is set.  So the kernel itself creates observable
           inconsistency.
      
           Another "optimization" is to retry an atomic unlock.  That's
           pointless as in a sane environment user space would not call into
           that code if it could have unlocked it atomically.  So we always
           check whether there is kernel state around and only if there is
           none, we do the unlock by setting the user space value to 0.
      
        4) Sanitize lookup_pi_state
      
           lookup_pi_state is ambigous about TID == 0 in the user space value.
      
           This can be a valid state even if there is kernel state on this
           uaddr, but we miss a few corner case checks.
      
           I tried to come up with a smaller solution hacking the checks into
           the current cruft, but it turned out to be ugly as hell and I got
           more confused than I was before.  So I rewrote the sanity checks
           along the state documentation with awful lots of commentry"
      
      * emailed patches from Thomas Gleixner <tglx@linutronix.de>:
        futex: Make lookup_pi_state more robust
        futex: Always cleanup owner tid in unlock_pi
        futex: Validate atomic acquisition in futex_lock_pi_atomic()
        futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 in futex_requeue(..., requeue_pi=1)
      1c5aefb5
    • Thomas Gleixner's avatar
      futex: Make lookup_pi_state more robust · 54a21788
      Thomas Gleixner authored
      The current implementation of lookup_pi_state has ambigous handling of
      the TID value 0 in the user space futex.  We can get into the kernel
      even if the TID value is 0, because either there is a stale waiters bit
      or the owner died bit is set or we are called from the requeue_pi path
      or from user space just for fun.
      
      The current code avoids an explicit sanity check for pid = 0 in case
      that kernel internal state (waiters) are found for the user space
      address.  This can lead to state leakage and worse under some
      circumstances.
      
      Handle the cases explicit:
      
             Waiter | pi_state | pi->owner | uTID      | uODIED | ?
      
        [1]  NULL   | ---      | ---       | 0         | 0/1    | Valid
        [2]  NULL   | ---      | ---       | >0        | 0/1    | Valid
      
        [3]  Found  | NULL     | --        | Any       | 0/1    | Invalid
      
        [4]  Found  | Found    | NULL      | 0         | 1      | Valid
        [5]  Found  | Found    | NULL      | >0        | 1      | Invalid
      
        [6]  Found  | Found    | task      | 0         | 1      | Valid
      
        [7]  Found  | Found    | NULL      | Any       | 0      | Invalid
      
        [8]  Found  | Found    | task      | ==taskTID | 0/1    | Valid
        [9]  Found  | Found    | task      | 0         | 0      | Invalid
        [10] Found  | Found    | task      | !=taskTID | 0/1    | Invalid
      
       [1] Indicates that the kernel can acquire the futex atomically. We
           came came here due to a stale FUTEX_WAITERS/FUTEX_OWNER_DIED bit.
      
       [2] Valid, if TID does not belong to a kernel thread. If no matching
           thread is found then it indicates that the owner TID has died.
      
       [3] Invalid. The waiter is queued on a non PI futex
      
       [4] Valid state after exit_robust_list(), which sets the user space
           value to FUTEX_WAITERS | FUTEX_OWNER_DIED.
      
       [5] The user space value got manipulated between exit_robust_list()
           and exit_pi_state_list()
      
       [6] Valid state after exit_pi_state_list() which sets the new owner in
           the pi_state but cannot access the user space value.
      
       [7] pi_state->owner can only be NULL when the OWNER_DIED bit is set.
      
       [8] Owner and user space value match
      
       [9] There is no transient state which sets the user space TID to 0
           except exit_robust_list(), but this is indicated by the
           FUTEX_OWNER_DIED bit. See [4]
      
      [10] There is no transient state which leaves owner and user space
           TID out of sync.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54a21788
    • Thomas Gleixner's avatar
      futex: Always cleanup owner tid in unlock_pi · 13fbca4c
      Thomas Gleixner authored
      If the owner died bit is set at futex_unlock_pi, we currently do not
      cleanup the user space futex.  So the owner TID of the current owner
      (the unlocker) persists.  That's observable inconsistant state,
      especially when the ownership of the pi state got transferred.
      
      Clean it up unconditionally.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      13fbca4c
    • Thomas Gleixner's avatar
      futex: Validate atomic acquisition in futex_lock_pi_atomic() · b3eaa9fc
      Thomas Gleixner authored
      We need to protect the atomic acquisition in the kernel against rogue
      user space which sets the user space futex to 0, so the kernel side
      acquisition succeeds while there is existing state in the kernel
      associated to the real owner.
      
      Verify whether the futex has waiters associated with kernel state.  If
      it has, return -EINVAL.  The state is corrupted already, so no point in
      cleaning it up.  Subsequent calls will fail as well.  Not our problem.
      
      [ tglx: Use futex_top_waiter() and explain why we do not need to try
        	restoring the already corrupted user space state. ]
      Signed-off-by: default avatarDarren Hart <dvhart@linux.intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Will Drewry <wad@chromium.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3eaa9fc
    • Thomas Gleixner's avatar
      futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 in... · e9c243a5
      Thomas Gleixner authored
      futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 in futex_requeue(..., requeue_pi=1)
      
      If uaddr == uaddr2, then we have broken the rule of only requeueing from
      a non-pi futex to a pi futex with this call.  If we attempt this, then
      dangling pointers may be left for rt_waiter resulting in an exploitable
      condition.
      
      This change brings futex_requeue() in line with futex_wait_requeue_pi()
      which performs the same check as per commit 6f7b0a2a ("futex: Forbid
      uaddr == uaddr2 in futex_wait_requeue_pi()")
      
      [ tglx: Compare the resulting keys as well, as uaddrs might be
        	different depending on the mapping ]
      
      Fixes CVE-2014-3153.
      
      Reported-by: Pinkie Pie
      Signed-off-by: default avatarWill Drewry <wad@chromium.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarDarren Hart <dvhart@linux.intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9c243a5
    • Igor Mammedov's avatar
      x86/smpboot: Initialize secondary CPU only if master CPU will wait for it · 3e1a878b
      Igor Mammedov authored
      Hang is observed on virtual machines during CPU hotplug,
      especially in big guests with many CPUs. (It reproducible
      more often if host is over-committed).
      
      It happens because master CPU gives up waiting on
      secondary CPU and allows it to run wild. As result
      AP causes locking or crashing system. For example
      as described here:
      
         https://lkml.org/lkml/2014/3/6/257
      
      If master CPU have sent STARTUP IPI successfully,
      and AP signalled to master CPU that it's ready
      to start initialization, make master CPU wait
      indefinitely till AP is onlined.
      To ensure that AP won't ever run wild, make it
      wait at early startup till master CPU confirms its
      intention to wait for AP. If AP doesn't respond in 10
      seconds, the master CPU will timeout and cancel
      AP onlining.
      Signed-off-by: default avatarIgor Mammedov <imammedo@redhat.com>
      Acked-by: default avatarToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1401975765-22328-4-git-send-email-imammedo@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      3e1a878b
    • Igor Mammedov's avatar
      x86/smpboot: Log error on secondary CPU wakeup failure at ERR level · feef1e8e
      Igor Mammedov authored
      If system is running without debug level logging,
      it will not log error if do_boot_cpu() failed to
      wakeup AP. It may lead to silent AP bringup
      failures at boot time.
      Change message level to KERN_ERR to make error
      visible to user as it's done on other architectures.
      Signed-off-by: default avatarIgor Mammedov <imammedo@redhat.com>
      Acked-by: default avatarToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1401975765-22328-3-git-send-email-imammedo@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      feef1e8e
    • Igor Mammedov's avatar
      x86: Fix list/memory corruption on CPU hotplug · 89f898c1
      Igor Mammedov authored
      currently if AP wake up is failed, master CPU marks AP as not
      present in do_boot_cpu() by calling set_cpu_present(cpu, false).
      That leads to following list corruption on the next physical CPU
      hotplug:
      
      [  418.107336] WARNING: CPU: 1 PID: 45 at lib/list_debug.c:33 __list_add+0xbe/0xd0()
      [  418.115268] list_add corruption. prev->next should be next (ffff88003dc57600), but was ffff88003e20c3a0. (prev=ffff88003e20c3a0).
      [  418.123693] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT ipt_REJECT cfg80211 xt_conntrack rfkill ee
      [  418.138979] CPU: 1 PID: 45 Comm: kworker/u10:1 Not tainted 3.14.0-rc6+ #387
      [  418.149989] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
      [  418.165750] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
      [  418.166433]  0000000000000021 ffff880038ca7988 ffffffff8159b22d 0000000000000021
      [  418.176460]  ffff880038ca79d8 ffff880038ca79c8 ffffffff8106942c ffff880038ca79e8
      [  418.177453]  ffff88003e20c3a0 ffff88003dc57600 ffff88003e20c3a0 00000000ffffffea
      [  418.178445] Call Trace:
      [  418.185811]  [<ffffffff8159b22d>] dump_stack+0x49/0x5c
      [  418.186440]  [<ffffffff8106942c>] warn_slowpath_common+0x8c/0xc0
      [  418.187192]  [<ffffffff81069516>] warn_slowpath_fmt+0x46/0x50
      [  418.191231]  [<ffffffff8136ef51>] ? acpi_ns_get_node+0xb7/0xc7
      [  418.193889]  [<ffffffff812f796e>] __list_add+0xbe/0xd0
      [  418.196649]  [<ffffffff812e2aa9>] kobject_add_internal+0x79/0x200
      [  418.208610]  [<ffffffff812e2e18>] kobject_add_varg+0x38/0x60
      [  418.213831]  [<ffffffff812e2ef4>] kobject_add+0x44/0x70
      [  418.229961]  [<ffffffff813e2c60>] device_add+0xd0/0x550
      [  418.234991]  [<ffffffff813f0e95>] ? pm_runtime_init+0xe5/0xf0
      [  418.250226]  [<ffffffff813e32be>] device_register+0x1e/0x30
      [  418.255296]  [<ffffffff813e82a3>] register_cpu+0xe3/0x130
      [  418.266539]  [<ffffffff81592be5>] arch_register_cpu+0x65/0x150
      [  418.285845]  [<ffffffff81355c0d>] acpi_processor_hotadd_init+0x5a/0x9b
      ...
      Which is caused by the fact that generic_processor_info() allocates
      logical CPU id by calling:
      
       cpu = cpumask_next_zero(-1, cpu_present_mask);
      
      which returns id of previously failed to wake up CPU, since its
      bit is cleared by do_boot_cpu() and as result register_cpu()
      tries to register another CPU with the same id as already
      present but failed to be onlined CPU.
      
      Taking in account that AP will not do anything if master CPU
      failed to wake it up, there is no reason to mark that AP as not
      present and break next cpu hotplug attempts. As a side effect of
      not marking AP as not present, user would be allowed to online
      it again later.
      
      Also fix memory corruption in acpi_unmap_lsapic()
      
      if during CPU hotplug master CPU failed to wake up AP
      it set percpu x86_cpu_to_apicid to BAD_APICID=0xFFFF for AP.
      
      However following attempt to unplug that CPU will lead to
      out of bound write access to __apicid_to_node[] which is
      32768 items long on x86_64 kernel.
      
      So with above fix of cpu_present_mask make sure that a present
      CPU has a valid APIC ID by not setting x86_cpu_to_apicid
      to BAD_APICID in do_boot_cpu() on failure and allow
      acpi_processor_remove()->acpi_unmap_lsapic() cleanly remove CPU.
      Signed-off-by: default avatarIgor Mammedov <imammedo@redhat.com>
      Acked-by: default avatarToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1401975765-22328-2-git-send-email-imammedo@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      89f898c1
    • Roman Gushchin's avatar
      sched/fair: Fix tg_set_cfs_bandwidth() deadlock on rq->lock · 09dc4ab0
      Roman Gushchin authored
      tg_set_cfs_bandwidth() sets cfs_b->timer_active to 0 to
      force the period timer restart. It's not safe, because
      can lead to deadlock, described in commit 927b54fc:
      "__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock,
      waiting for the hrtimer to finish. However, if sched_cfs_period_timer
      runs for another loop iteration, the hrtimer can attempt to take
      rq->lock, resulting in deadlock."
      
      Three CPUs must be involved:
      
        CPU0               CPU1                         CPU2
        take rq->lock      period timer fired
        ...                take cfs_b lock
        ...                ...                          tg_set_cfs_bandwidth()
        throttle_cfs_rq()  release cfs_b lock           take cfs_b lock
        ...                distribute_cfs_runtime()     timer_active = 0
        take cfs_b->lock   wait for rq->lock            ...
        __start_cfs_bandwidth()
        {wait for timer callback
         break if timer_active == 1}
      
      So, CPU0 and CPU1 are deadlocked.
      
      Instead of resetting cfs_b->timer_active, tg_set_cfs_bandwidth can
      wait for period timer callbacks (ignoring cfs_b->timer_active) and
      restart the timer explicitly.
      Signed-off-by: default avatarRoman Gushchin <klamm@yandex-team.ru>
      Reviewed-by: default avatarBen Segall <bsegall@google.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/87wqdi9g8e.wl\%klamm@yandex-team.ru
      Cc: pjt@google.com
      Cc: chris.j.arges@canonical.com
      Cc: gregkh@linuxfoundation.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      09dc4ab0
    • Kirill Tkhai's avatar
      sched/dl: Fix race in dl_task_timer() · 0f397f2c
      Kirill Tkhai authored
      Throttled task is still on rq, and it may be moved to other cpu
      if user is playing with sched_setaffinity(). Therefore, unlocked
      task_rq() access makes the race.
      
      Juri Lelli reports he got this race when dl_bandwidth_enabled()
      was not set.
      
      Other thing, pointed by Peter Zijlstra:
      
         "Now I suppose the problem can still actually happen when
          you change the root domain and trigger a effective affinity
          change that way".
      
      To fix that we do the same as made in __task_rq_lock(). We do not
      use __task_rq_lock() itself, because it has a useful lockdep check,
      which is not correct in case of dl_task_timer(). We do not need
      pi_lock locked here. This case is an exception (PeterZ):
      
         "The only reason we don't strictly need ->pi_lock now is because
          we're guaranteed to have p->state == TASK_RUNNING here and are
          thus free of ttwu races".
      Signed-off-by: default avatarKirill Tkhai <tkhai@yandex.ru>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org> # v3.14+
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/3056991400578422@web14g.yandex.ruSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0f397f2c
    • Richard Weinberger's avatar
      sched: Fix sched_policy < 0 comparison · b14ed2c2
      Richard Weinberger authored
      attr.sched_policy is u32, therefore a comparison against < 0 is never true.
      Fix this by casting sched_policy to int.
      
      This issue was reported by coverity CID 1219934.
      
      Fixes: dbdb2275 ("sched: Disallow sched_attr::sched_policy < 0")
      Signed-off-by: default avatarRichard Weinberger <richard@nod.at>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1401741514-7045-1-git-send-email-richard@nod.atSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b14ed2c2