1. 21 Feb, 2015 2 commits
    • Jens Axboe's avatar
      Merge branch 'for-3.20' of git://git.infradead.org/users/kbusch/linux-nvme into for-linus · decf6d79
      Jens Axboe authored
      Merge 3.20 NVMe changes from Keith.
      decf6d79
    • Thadeu Lima de Souza Cascardo's avatar
      blk-throttle: check stats_cpu before reading it from sysfs · 045c47ca
      Thadeu Lima de Souza Cascardo authored
      When reading blkio.throttle.io_serviced in a recently created blkio
      cgroup, it's possible to race against the creation of a throttle policy,
      which delays the allocation of stats_cpu.
      
      Like other functions in the throttle code, just checking for a NULL
      stats_cpu prevents the following oops caused by that race.
      
      [ 1117.285199] Unable to handle kernel paging request for data at address 0x7fb4d0020
      [ 1117.285252] Faulting instruction address: 0xc0000000003efa2c
      [ 1137.733921] Oops: Kernel access of bad area, sig: 11 [#1]
      [ 1137.733945] SMP NR_CPUS=2048 NUMA PowerNV
      [ 1137.734025] Modules linked in: bridge stp llc kvm_hv kvm binfmt_misc autofs4
      [ 1137.734102] CPU: 3 PID: 5302 Comm: blkcgroup Not tainted 3.19.0 #5
      [ 1137.734132] task: c000000f1d188b00 ti: c000000f1d210000 task.ti: c000000f1d210000
      [ 1137.734167] NIP: c0000000003efa2c LR: c0000000003ef9f0 CTR: c0000000003ef980
      [ 1137.734202] REGS: c000000f1d213500 TRAP: 0300   Not tainted  (3.19.0)
      [ 1137.734230] MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI>  CR: 42008884  XER: 20000000
      [ 1137.734325] CFAR: 0000000000008458 DAR: 00000007fb4d0020 DSISR: 40000000 SOFTE: 0
      GPR00: c0000000003ed3a0 c000000f1d213780 c000000000c59538 0000000000000000
      GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000000
      GPR08: ffffffffffffffff 00000007fb4d0020 00000007fb4d0000 c000000000780808
      GPR12: 0000000022000888 c00000000fdc0d80 0000000000000000 0000000000000000
      GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
      GPR20: 000001003e120200 c000000f1d5b0cc0 0000000000000200 0000000000000000
      GPR24: 0000000000000001 c000000000c269e0 0000000000000020 c000000f1d5b0c80
      GPR28: c000000000ca3a08 c000000000ca3dec c000000f1c667e00 c000000f1d213850
      [ 1137.734886] NIP [c0000000003efa2c] .tg_prfill_cpu_rwstat+0xac/0x180
      [ 1137.734915] LR [c0000000003ef9f0] .tg_prfill_cpu_rwstat+0x70/0x180
      [ 1137.734943] Call Trace:
      [ 1137.734952] [c000000f1d213780] [d000000005560520] 0xd000000005560520 (unreliable)
      [ 1137.734996] [c000000f1d2138a0] [c0000000003ed3a0] .blkcg_print_blkgs+0xe0/0x1a0
      [ 1137.735039] [c000000f1d213960] [c0000000003efb50] .tg_print_cpu_rwstat+0x50/0x70
      [ 1137.735082] [c000000f1d2139e0] [c000000000104b48] .cgroup_seqfile_show+0x58/0x150
      [ 1137.735125] [c000000f1d213a70] [c0000000002749dc] .kernfs_seq_show+0x3c/0x50
      [ 1137.735161] [c000000f1d213ae0] [c000000000218630] .seq_read+0xe0/0x510
      [ 1137.735197] [c000000f1d213bd0] [c000000000275b04] .kernfs_fop_read+0x164/0x200
      [ 1137.735240] [c000000f1d213c80] [c0000000001eb8e0] .__vfs_read+0x30/0x80
      [ 1137.735276] [c000000f1d213cf0] [c0000000001eb9c4] .vfs_read+0x94/0x1b0
      [ 1137.735312] [c000000f1d213d90] [c0000000001ebb38] .SyS_read+0x58/0x100
      [ 1137.735349] [c000000f1d213e30] [c000000000009218] syscall_exit+0x0/0x98
      [ 1137.735383] Instruction dump:
      [ 1137.735405] 7c6307b4 7f891800 409d00b8 60000000 60420000 3d420004 392a63b0 786a1f24
      [ 1137.735471] 7d49502a e93e01c8 7d495214 7d2ad214 <7cead02a> e9090008 e9490010 e9290018
      
      And here is one code that allows to easily reproduce this, although this
      has first been found by running docker.
      
      void run(pid_t pid)
      {
      	int n;
      	int status;
      	int fd;
      	char *buffer;
      	buffer = memalign(BUFFER_ALIGN, BUFFER_SIZE);
      	n = snprintf(buffer, BUFFER_SIZE, "%d\n", pid);
      	fd = open(CGPATH "/test/tasks", O_WRONLY);
      	write(fd, buffer, n);
      	close(fd);
      	if (fork() > 0) {
      		fd = open("/dev/sda", O_RDONLY | O_DIRECT);
      		read(fd, buffer, 512);
      		close(fd);
      		wait(&status);
      	} else {
      		fd = open(CGPATH "/test/blkio.throttle.io_serviced", O_RDONLY);
      		n = read(fd, buffer, BUFFER_SIZE);
      		close(fd);
      	}
      	free(buffer);
      	exit(0);
      }
      
      void test(void)
      {
      	int status;
      	mkdir(CGPATH "/test", 0666);
      	if (fork() > 0)
      		wait(&status);
      	else
      		run(getpid());
      	rmdir(CGPATH "/test");
      }
      
      int main(int argc, char **argv)
      {
      	int i;
      	for (i = 0; i < NR_TESTS; i++)
      		test();
      	return 0;
      }
      Reported-by: default avatarRicardo Marin Matinata <rmm@br.ibm.com>
      Signed-off-by: default avatarThadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      045c47ca
  2. 20 Feb, 2015 1 commit
    • Linus Torvalds's avatar
      Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal · 3d883483
      Linus Torvalds authored
      Pull more thermal managament updates from Zhang Rui:
       "Specifics:
      
         - Exynos thermal driver refactoring.  Several cleanups, code
           optimization, unused symbols removal, and unused feature removal in
           Exynos thermal driver.  Thanks Lukasz for this effort.
      
         - Exynos thermal driver support to OF thermal.  After the code
           refactoring, the driver earned the support to OF thermal.  Chip
           thermal data were moved from driver code to DTS, reducing the code
           footprint.  Thanks Lukasz for this.
      
         - After receiving the OF thermal support, the exynos thermal driver
           now must allow modular build.  Thanks Arnd for detecting, reporting
           and fixing this.
      
         - Exynos thermal driver support to Exynos 7 SoC.  Thanks Abhilash for
           this.
      
         - Accurate temperature reporting on Rockchip thermal driver, thanks
           to Caesar.
      
         - Fix on how OF thermal enables its zones, thanks Lukasz for fixing.
      
         - Fixes in OF thermal examples under Documentation/.  Thanks Srinivas
           for fixing"
      
      * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal:
        thermal: exynos: Add TMU support for Exynos7 SoC
        dts: Documentation: Add documentation for Exynos7 SoC thermal bindings
        cpufreq: exynos: allow modular build
        thermal: Fix examples in DT documentation
        thermal: exynos: Correct sanity check at exynos_report_trigger() function
        thermal: Kconfig: Remove config for not used EXYNOS_THERMAL_CORE
        thermal: exynos: Remove exynos_tmu_data.c file
        thermal: rockchip: make temperature reporting much more accurate
        thermal: exynos: Remove exynos_thermal_common.[c|h] files
        thermal: samsung: core: Exynos TMU rework to use device tree for configuration
        dts: Documentation: Update exynos-thermal.txt example for Exynos5440
        dts: Documentation: Extending documentation entry for exynos-thermal
        cpufreq: exynos: Use device tree to determine if cpufreq cooling should be registered
        thermal: exynos: Modify exynos thermal code to use device tree for cpu cooling configuration
        thermal: exynos: Provide thermal_exynos.h file to be included in device tree files
        thermal: exynos: cosmetic: Correct comment format
        thermal: of: Enable thermal_zoneX when sensor is correctly added
      3d883483
  3. 19 Feb, 2015 37 commits
    • Keith Busch's avatar
      NVMe: Fix potential corruption on sync commands · 0c0f9b95
      Keith Busch authored
      This makes all sync commands uninterruptible and schedules without timeout
      so the controller either has to post a completion or the timeout recovery
      fails the command. This fixes potential memory or data corruption from
      a command timing out too early or woken by a signal. Previously any DMA
      buffers mapped for that command would have been released even though we
      don't know what the controller is planning to do with those addresses.
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      0c0f9b95
    • Keith Busch's avatar
      NVMe: Remove unused variables · 48328518
      Keith Busch authored
      We don't track queues in a llist, subscribe to hot-cpu notifications,
      or internally retry commands. Delete the unused artifacts.
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      48328518
    • Keith Busch's avatar
      NVMe: Fix scsi mode select llbaa setting · 9ac16938
      Keith Busch authored
      It should be a logical bitwise AND, not conditional.
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      9ac16938
    • Keith Busch's avatar
      NVMe: Fix potential corruption during shutdown · 07836e65
      Keith Busch authored
      The driver has to end unreturned commands at some point even if the
      controller has not provided a completion. The driver tried to be safe by
      deleting IO queues prior to ending all unreturned commands. That should
      cause the controller to internally abort inflight commands, but IO queue
      deletion request does not have to be successful, so all bets are off. We
      still have to make progress, so to be extra safe, this patch doesn't
      clear a queue to release the dma mapping for a command until after the
      pci device has been disabled.
      
      This patch removes the special handling during device initialization
      so controller recovery can be done all the time. This is possible since
      initialization is not inlined with pci probe anymore.
      Reported-by: default avatarNilish Choudhury <nilesh.choudhury@oracle.com>
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      07836e65
    • Keith Busch's avatar
      NVMe: Asynchronous controller probe · 2e1d8448
      Keith Busch authored
      This performs the longest parts of nvme device probe in scheduled work.
      This speeds up probe significantly when multiple devices are in use.
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      2e1d8448
    • Keith Busch's avatar
      NVMe: Register management handle under nvme class · b3fffdef
      Keith Busch authored
      This creates a new class type for nvme devices to register their
      management character devices with. This is so we do not rely on miscdev
      to provide enough minors for as many nvme devices some people plan to
      use. The previous limit was approximately 60 NVMe controllers, depending
      on the platform and kernel. Now the limit is 1M, which ought to be enough
      for anybody.
      
      Since we have a new device class, it makes sense to attach the block
      devices under this as well, so part of this patch moves the management
      handle initialization prior to the namespaces discovery.
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      b3fffdef
    • Keith Busch's avatar
      NVMe: Update SCSI Inquiry VPD 83h translation · 4f1982b4
      Keith Busch authored
      The original translation created collisions on Inquiry VPD 83 for many
      existing devices. Newer specifications provide other ways to translate
      based on the device's version can be used to create unique identifiers.
      
      Version 1.1 provides an EUI64 field that uniquely identifies each
      namespace, and 1.2 added the longer NGUID field for the same reason.
      Both follow the IEEE EUI format and readily translate to the SCSI device
      identification EUI designator type 2h. For devices implementing either,
      the translation will use this type, defaulting to the EUI64 8-byte type if
      implemented then NGUID's 16 byte version if not. If neither are provided,
      the 1.0 translation is used, and is updated to use the SCSI String format
      to guarantee a unique identifier.
      
      Knowing when to use the new fields depends on the nvme controller's
      revision. The NVME_VS macro was not decoding this correctly, so that is
      fixed in this patch and moved to a more appropriate place.
      
      Since the Identify Namespace structure required an update for the NGUID
      field, this patch adds the remaining new 1.2 fields to the structure.
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      4f1982b4
    • Keith Busch's avatar
      NVMe: Metadata format support · e1e5e564
      Keith Busch authored
      Adds support for NVMe metadata formats and exposes block devices for
      all namespaces regardless of their format. Namespace formats that are
      unusable will have disk capacity set to 0, but a handle to the block
      device is created to simplify device management. A namespace is not
      usable when the format requires host interleave block and metadata in
      single buffer, has no provisioned storage, or has better data but failed
      to register with blk integrity.
      
      The namespace has to be scanned in two phases to support separate
      metadata formats. The first establishes the sector size and capacity
      prior to invoking add_disk. If metadata is required, the capacity will
      be temporarilly set to 0 until it can be revalidated and registered with
      the integrity extenstions after add_disk completes.
      
      The driver relies on the integrity extensions to provide the metadata
      buffer. NVMe requires this be a single physically contiguous region,
      so only one integrity segment is allowed per command. If the metadata
      is used for T10 PI, the driver provides mappings to save and restore
      the reftag physical block translation. The driver provides no-op
      functions for generate and verify if metadata is not used for protection
      information. This way the setup is always provided by the block layer.
      
      If a request does not supply a required metadata buffer, the command
      is failed with bad address. This could only happen if a user manually
      disables verify/generate on such a disk. The only exception to where
      this is okay is if the controller is capable of stripping/generating
      the metadata, which is possible on some types of formats.
      
      The metadata scatter gather list now occupies the spot in the nvme_iod
      that used to be used to link retryable IOD's, but we don't do that
      anymore, so the field was unused.
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      e1e5e564
    • David Vrabel's avatar
      x86: pte_protnone() and pmd_protnone() must check entry is not present · e3a1f6ca
      David Vrabel authored
      Since _PAGE_PROTNONE aliases _PAGE_GLOBAL it is only valid if
      _PAGE_PRESENT is clear.  Make pte_protnone() and pmd_protnone() check
      for this.
      
      This fixes a 64-bit Xen PV guest regression introduced by 8a0516ed
      ("mm: convert p[te|md]_numa users to p[te|md]_protnone_numa").  Any
      userspace process would endlessly fault.
      
      In a 64-bit PV guest, userspace page table entries have _PAGE_GLOBAL set
      by the hypervisor.  This meant that any fault on a present userspace
      entry (e.g., a write to a read-only mapping) would be misinterpreted as
      a NUMA hinting fault and the fault would not be correctly handled,
      resulting in the access endlessly faulting.
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e3a1f6ca
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs · 2b9fb532
      Linus Torvalds authored
      Pull btrfs updates from Chris Mason:
       "This pull is mostly cleanups and fixes:
      
         - The raid5/6 cleanups from Zhao Lei fixup some long standing warts
           in the code and add improvements on top of the scrubbing support
           from 3.19.
      
         - Josef has round one of our ENOSPC fixes coming from large btrfs
           clusters here at FB.
      
         - Dave Sterba continues a long series of cleanups (thanks Dave), and
           Filipe continues hammering on corner cases in fsync and others
      
        This all was held up a little trying to track down a use-after-free in
        btrfs raid5/6.  It's not clear yet if this is just made easier to
        trigger with this pull or if its a new bug from the raid5/6 cleanups.
        Dave Sterba is the only one to trigger it so far, but he has a
        consistent way to reproduce, so we'll get it nailed shortly"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (68 commits)
        Btrfs: don't remove extents and xattrs when logging new names
        Btrfs: fix fsync data loss after adding hard link to inode
        Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group
        Btrfs: account for large extents with enospc
        Btrfs: don't set and clear delalloc for O_DIRECT writes
        Btrfs: only adjust outstanding_extents when we do a short write
        btrfs: Fix out-of-space bug
        Btrfs: scrub, fix sleep in atomic context
        Btrfs: fix scheduler warning when syncing log
        Btrfs: Remove unnecessary placeholder in btrfs_err_code
        btrfs: cleanup init for list in free-space-cache
        btrfs: delete chunk allocation attemp when setting block group ro
        btrfs: clear bio reference after submit_one_bio()
        Btrfs: fix scrub race leading to use-after-free
        Btrfs: add missing cleanup on sysfs init failure
        Btrfs: fix race between transaction commit and empty block group removal
        btrfs: add more checks to btrfs_read_sys_array
        btrfs: cleanup, rename a few variables in btrfs_read_sys_array
        btrfs: add checks for sys_chunk_array sizes
        btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize
        ...
      2b9fb532
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client · 4533f6e2
      Linus Torvalds authored
      Pull Ceph changes from Sage Weil:
       "On the RBD side, there is a conversion to blk-mq from Christoph,
        several long-standing bug fixes from Ilya, and some cleanup from
        Rickard Strandqvist.
      
        On the CephFS side there is a long list of fixes from Zheng, including
        improved session handling, a few IO path fixes, some dcache management
        correctness fixes, and several blocking while !TASK_RUNNING fixes.
      
        The core code gets a few cleanups and Chaitanya has added support for
        TCP_NODELAY (which has been used on the server side for ages but we
        somehow missed on the kernel client).
      
        There is also an update to MAINTAINERS to fix up some email addresses
        and reflect that Ilya and Zheng are doing most of the maintenance for
        RBD and CephFS these days.  Do not be surprised to see a pull request
        come from one of them in the future if I am unavailable for some
        reason"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (27 commits)
        MAINTAINERS: update Ceph and RBD maintainers
        libceph: kfree() in put_osd() shouldn't depend on authorizer
        libceph: fix double __remove_osd() problem
        rbd: convert to blk-mq
        ceph: return error for traceless reply race
        ceph: fix dentry leaks
        ceph: re-send requests when MDS enters reconnecting stage
        ceph: show nocephx_require_signatures and notcp_nodelay options
        libceph: tcp_nodelay support
        rbd: do not treat standalone as flatten
        ceph: fix atomic_open snapdir
        ceph: properly mark empty directory as complete
        client: include kernel version in client metadata
        ceph: provide seperate {inode,file}_operations for snapdir
        ceph: fix request time stamp encoding
        ceph: fix reading inline data when i_size > PAGE_SIZE
        ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_close_sessions)
        ceph: avoid block operation when !TASK_RUNNING (ceph_get_caps)
        ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_sync)
        rbd: fix error paths in rbd_dev_refresh()
        ...
      4533f6e2
    • Linus Torvalds's avatar
      Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux · 89d3fa45
      Linus Torvalds authored
      Pull thermal managament updates from Zhang Rui:
       "Specifics:
      
         - Abstract the code and introduce helper functions for all int340x
           thermal drivers.  From: Srinivas Pandruvada.
      
         - Reorganize the ACPI LPAT table support code so that it can be
           shared for both ACPI PMIC driver and int340x thermal driver.
      
         - Add support for Braswell in intel_soc_dts thermal driver.
      
         - a couple of small fixes/cleanups for step_wise governor and int340x
           thermal driver"
      
      * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux:
        Thermal/int340x_thermal: remove unused uuids.
        thermal: step_wise: spelling fixes
        thermal: int340x: fix sparse warning
        Thermal/int340x: LPAT conversion for temperature
        ACPI / PMIC: Use common LPAT table handling functions
        ACPI / LPAT: Common table processing functions
        thermal: Intel SoC DTS: Add Braswell support
        Thermal/int340x/int3402: Provide notification support
        Thermal/int340x/processor_thermal: Add thermal zone support
        Thermal/int340x/int3403: Use int340x thermal API
        Thermal/int340x/int3402: Use int340x thermal API
        Thermal/int340x: Add common thermal zone handler
      89d3fa45
    • Linus Torvalds's avatar
      Merge tag 'edac_fixes_for_3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp · 477ea116
      Linus Torvalds authored
      Pull two EDAC fixes from Borislav Petkov:
      
       - A fix to sb_edac for proper detection on SNB machines
      
       - A fix to amd64_edac to not explode on Numascale machines with more
         than 16 memory controllers, from Daniel J Blueman.
      
      * tag 'edac_fixes_for_3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp:
        EDAC, amd64_edac: Prevent OOPS with >16 memory controllers
        sb_edac: Fix detection on SNB machines
      477ea116
    • Linus Torvalds's avatar
      Merge tag 'platform-drivers-x86-v3.20-1' of... · 6ed3e57f
      Linus Torvalds authored
      Merge tag 'platform-drivers-x86-v3.20-1' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86
      
      Pull platform driver update from Darren Hart:
       "This includes a significant update to the toshiba_acpi driver,
        bringing it to feature parity with the Windows driver, followed by
        some needed cleanups.
      
        The other changes are mostly minor updates, quirks, sparse fixes, or
        cleanups.
      
        Details:
      
         - toshiba_acpi:
             Add support for missing features from the Windows driver, bump the
             sysfs version, and clean up the driver.
      
         - thinkpad_acpi:
             BIOS string versions, unhandled hkey events.
      
         - msamsung-laptop:
             Add native backlight quirk, enable better lid handling.
      
         - intel_scu_ipc:
             Read resources from PCI configuration
      
         - other:
             Fix sparse warnings, general cleanups"
      
      * tag 'platform-drivers-x86-v3.20-1' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86: (34 commits)
        toshiba_acpi: Cleanup GPL header
        toshiba_acpi: Cleanup comment blocks and capitalization
        toshiba_acpi: Make use of DEVICE_ATTR_{RO, RW} macros
        toshiba_acpi: Drop the toshiba_ prefix from sysfs function names
        toshiba_acpi: Move sysfs function and struct declarations further down
        Documentation/ABI: Add file describing the sysfs entries for toshiba_acpi
        toshiba_acpi: Clean file according to coding style
        toshiba_acpi: Bump version number to 0.21
        toshiba_acpi: Add support to enable/disable USB 3
        toshiba_acpi: Add support for Panel Power ON
        toshiba_acpi: Add support for Keyboard functions mode
        toshiba_acpi: Add fan entry to sysfs
        toshiba_acpi: Add version entry to sysfs
        thinkpad_acpi: support new BIOS version string pattern
        thinkpad_acpi: unhandled hkey event
        toshiba_acpi: Make toshiba_eco_mode_available more robust
        classmate-laptop: Fix sparse warning (0 as NULL)
        Sony-laptop: Fix sparse warning (make undeclared var static)
        thinkpad_acpi.c: Fix sparse warning (make undeclared var static)
        samsung-laptop.c: Prefer kstrtoint over single variable sscanf
        ...
      6ed3e57f
    • Linus Torvalds's avatar
      Merge branch 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild · b11a2783
      Linus Torvalds authored
      Pull kconfig updates from Michal Marek:
       "Yann E Morin was supposed to take over kconfig maintainership, but
        this hasn't happened.  So I'm sending a few kconfig patches that I
        collected:
      
         - Fix for missing va_end in kconfig
         - merge_config.sh displays used if given too few arguments
         - s/boolean/bool/ in Kconfig files for consistency, with the plan to
           only support bool in the future"
      
      * 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
        kconfig: use va_end to match corresponding va_start
        merge_config.sh: Display usage if given too few arguments
        kconfig: use bool instead of boolean for type definition attributes
      b11a2783
    • Linus Torvalds's avatar
      Merge branch 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild · 77343343
      Linus Torvalds authored
      Pull misc kbuild changes from Michal Marek:
       "Just a few non-critical kbuild changes:
      
         - builddeb adds the actual distribution name in the changelog
         - documentation fixes"
      
      * 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
        kbuild: trivial - fix the help doc of CONFIG_CC_OPTIMIZE_FOR_SIZE
        kbuild: Update documentation of clean-files and clean-dirs
        builddeb: Try to determine distribution
        builddeb: Update year and git repository URL in debian/copyright
      77343343
    • Sage Weil's avatar
      MAINTAINERS: update Ceph and RBD maintainers · 0f5417ce
      Sage Weil authored
      - add Ilya, drop Yehuda as an RBD maintainer
      - add Zheng as a Ceph maintainer
      - update Yehuda and Sage's emails
      Signed-off-by: default avatarSage Weil <sage@redhat.com>
      0f5417ce
    • Linus Torvalds's avatar
      Merge branch 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild · 27a22ee4
      Linus Torvalds authored
      Pull kbuild updates from Michal Marek:
      
       - several cleanups in kbuild
      
       - serialize multiple *config targets so that 'make defconfig kvmconfig'
         works
      
       - The cc-ifversion macro got support for an else-branch
      
      * 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
        kbuild,gcov: simplify kernel/gcov/Makefile more
        kbuild: allow cc-ifversion to have the argument for false condition
        kbuild,gcov: simplify kernel/gcov/Makefile
        kbuild,gcov: remove unnecessary workaround
        kbuild: do not add $(call ...) to invoke cc-version or cc-fullversion
        kbuild: fix cc-ifversion macro
        kbuild: drop $(version_h) from MRPROPER_FILES
        kbuild: use mixed-targets when two or more config targets are given
        kbuild: remove redundant line from bounds.h/asm-offsets.h
        kbuild: merge bounds.h and asm-offsets.h rules
        kbuild: Drop support for clean-rule
      27a22ee4
    • Ilya Dryomov's avatar
      libceph: kfree() in put_osd() shouldn't depend on authorizer · b28ec2f3
      Ilya Dryomov authored
      a255651d ("ceph: ensure auth ops are defined before use") made
      kfree() in put_osd() conditional on the authorizer.  A mechanical
      mistake most likely - fix it.
      
      Cc: Alex Elder <elder@linaro.org>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: default avatarSage Weil <sage@redhat.com>
      Reviewed-by: default avatarAlex Elder <elder@linaro.org>
      b28ec2f3
    • Ilya Dryomov's avatar
      libceph: fix double __remove_osd() problem · 7eb71e03
      Ilya Dryomov authored
      It turns out it's possible to get __remove_osd() called twice on the
      same OSD.  That doesn't sit well with rb_erase() - depending on the
      shape of the tree we can get a NULL dereference, a soft lockup or
      a random crash at some point in the future as we end up touching freed
      memory.  One scenario that I was able to reproduce is as follows:
      
                  <osd3 is idle, on the osd lru list>
      <con reset - osd3>
      con_fault_finish()
        osd_reset()
                                    <osdmap - osd3 down>
                                    ceph_osdc_handle_map()
                                      <takes map_sem>
                                      kick_requests()
                                        <takes request_mutex>
                                        reset_changed_osds()
                                          __reset_osd()
                                            __remove_osd()
                                        <releases request_mutex>
                                      <releases map_sem>
          <takes map_sem>
          <takes request_mutex>
          __kick_osd_requests()
            __reset_osd()
              __remove_osd() <-- !!!
      
      A case can be made that osd refcounting is imperfect and reworking it
      would be a proper resolution, but for now Sage and I decided to fix
      this by adding a safe guard around __remove_osd().
      
      Fixes: http://tracker.ceph.com/issues/8087
      
      Cc: Sage Weil <sage@redhat.com>
      Cc: stable@vger.kernel.org # 3.9+: 7c6e6fc5: libceph: assert both regular and lingering lists in __remove_osd()
      Cc: stable@vger.kernel.org # 3.9+: cc9f1f51: libceph: change from BUG to WARN for __remove_osd() asserts
      Cc: stable@vger.kernel.org # 3.9+
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: default avatarSage Weil <sage@redhat.com>
      Reviewed-by: default avatarAlex Elder <elder@linaro.org>
      7eb71e03
    • Christoph Hellwig's avatar
      rbd: convert to blk-mq · 7ad18afa
      Christoph Hellwig authored
      This converts the rbd driver to use the blk-mq infrastructure.  Except
      for switching to a per-request work item this is almost mechanical.
      
      This was tested by Alexandre DERUMIER in November, and found to give
      him 120000 iops, although the only comparism available was an old
      3.10 kernel which gave 80000iops.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarAlex Elder <elder@linaro.org>
      [idryomov@gmail.com: context, blk_mq_init_queue() EH]
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      7ad18afa
    • Yan, Zheng's avatar
      ceph: return error for traceless reply race · 4d41cef2
      Yan, Zheng authored
      When we receives traceless reply for request that created new inode,
      we re-send a lookup request to MDS get information of the newly created
      inode. (VFS expects FS' callback return an inode in create case)
      This breaks one request into two requests. Other client may modify or
      move to the new inode in the middle.
      
      When the race happens, ceph_handle_notrace_create() unconditionally
      links the dentry for 'create' operation to the inode returned by lookup.
      This may confuse VFS when the inode is a directory (VFS does not allow
      multiple linkages for directory inode).
      
      This patch makes ceph_handle_notrace_create() when it detect a race.
      This event should be rare and it happens only when we talk to old MDS.
      Recent MDS does not send traceless reply for request that creates new
      inode.
      Signed-off-by: default avatarYan, Zheng <zyan@redhat.com>
      4d41cef2
    • Yan, Zheng's avatar
      ceph: fix dentry leaks · 5cba372c
      Yan, Zheng authored
      Signed-off-by: default avatarYan, Zheng <zyan@redhat.com>
      5cba372c
    • Yan, Zheng's avatar
      ceph: re-send requests when MDS enters reconnecting stage · 3de22be6
      Yan, Zheng authored
      So that MDS can check if any request is already completed and process
      completed requests in clientreplay stage. When completed requests are
      processed in clientreplay stage, MDS can avoid sending traceless
      replies.
      Signed-off-by: default avatarYan, Zheng <zyan@redhat.com>
      3de22be6
    • Ilya Dryomov's avatar
    • Chaitanya Huilgol's avatar
      libceph: tcp_nodelay support · ba988f87
      Chaitanya Huilgol authored
      TCP_NODELAY socket option set on connection sockets,
      disables Nagle’s algorithm and improves latency characteristics.
      tcp_nodelay(default)/notcp_nodelay option flags provided to
      enable/disable setting the socket option.
      Signed-off-by: default avatarChaitanya Huilgol <chaitanya.huilgol@sandisk.com>
      [idryomov@redhat.com: NO_TCP_NODELAY -> TCP_NODELAY, minor adjustments]
      Signed-off-by: default avatarIlya Dryomov <idryomov@redhat.com>
      ba988f87
    • Ilya Dryomov's avatar
      rbd: do not treat standalone as flatten · cf32bd9c
      Ilya Dryomov authored
      If the clone is resized down to 0, it becomes standalone.  If such
      resize is carried over while an image is mapped we would detect this
      and call rbd_dev_parent_put() which means "let go of all parent state,
      including the spec(s) of parent images(s)".  This leads to a mismatch
      between "rbd info" and sysfs parent fields, so a fix is in order.
      
          # rbd create --image-format 2 --size 1 foo
          # rbd snap create foo@snap
          # rbd snap protect foo@snap
          # rbd clone foo@snap bar
          # DEV=$(rbd map bar)
          # rbd resize --allow-shrink --size 0 bar
          # rbd resize --size 1 bar
          # rbd info bar | grep parent
                  parent: rbd/foo@snap
      
      Before:
      
          # cat /sys/bus/rbd/devices/0/parent
          (no parent image)
      
      After:
      
          # cat /sys/bus/rbd/devices/0/parent
          pool_id 0
          pool_name rbd
          image_id 10056b8b4567
          image_name foo
          snap_id 2
          snap_name snap
          overlap 0
      Signed-off-by: default avatarIlya Dryomov <idryomov@redhat.com>
      Reviewed-by: default avatarJosh Durgin <jdurgin@redhat.com>
      Reviewed-by: default avatarAlex Elder <elder@linaro.org>
      cf32bd9c
    • Yan, Zheng's avatar
      ceph: fix atomic_open snapdir · bf91c315
      Yan, Zheng authored
      ceph_handle_snapdir() checks ceph_mdsc_do_request()'s return value
      and creates snapdir inode if it's -ENOENT
      Signed-off-by: default avatarYan, Zheng <zyan@redhat.com>
      bf91c315
    • Yan, Zheng's avatar
      ceph: properly mark empty directory as complete · 2f92b3d0
      Yan, Zheng authored
      ceph_add_cap() calls __check_cap_issue(), which clears directory
      inode' complete flag. so we should set the complete flag for empty
      directory should be set after calling ceph_add_cap().
      Signed-off-by: default avatarYan, Zheng <zyan@redhat.com>
      2f92b3d0
    • Yan, Zheng's avatar
      client: include kernel version in client metadata · a6a5ce4f
      Yan, Zheng authored
      Signed-off-by: default avatarYan, Zheng <zyan@redhat.com>
      a6a5ce4f
    • Yan, Zheng's avatar
      ceph: provide seperate {inode,file}_operations for snapdir · 38c48b5f
      Yan, Zheng authored
      remove all unsupported operations from {inode,file}_operations.
      Signed-off-by: default avatarYan, Zheng <zyan@redhat.com>
      38c48b5f
    • Yan, Zheng's avatar
      ceph: fix request time stamp encoding · 1f041a89
      Yan, Zheng authored
      struct timespec uses 'long' to present second and nanosecond. 'long'
      is 64 bits on 64bits machine. ceph MDS expects time stamp to be
      encoded as struct ceph_timespec, which uses 'u32' to present second
      and nanosecond.
      Signed-off-by: default avatarYan, Zheng <zyan@redhat.com>
      1f041a89
    • Yan, Zheng's avatar
      ceph: fix reading inline data when i_size > PAGE_SIZE · fcc02d2a
      Yan, Zheng authored
      when inode has inline data but its size > PAGE_SIZE (it was truncated
      to larger size), previous direct read code return -EIO. This patch adds
      code to return zeros for data whose offset > PAGE_SIZE.
      Signed-off-by: default avatarYan, Zheng <zyan@redhat.com>
      fcc02d2a
    • Yan, Zheng's avatar
      ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_close_sessions) · 86d8f67b
      Yan, Zheng authored
      use an atomic variable to track number of sessions, this can avoid block
      operation inside wait loops.
      Signed-off-by: default avatarYan, Zheng <zyan@redhat.com>
      86d8f67b
    • Yan, Zheng's avatar
      ceph: avoid block operation when !TASK_RUNNING (ceph_get_caps) · c4d4a582
      Yan, Zheng authored
      we should not do block operation in wait_event_interruptible()'s condition
      check function, but reading inline data can block. so move the read inline
      data code to ceph_get_caps()
      Signed-off-by: default avatarYan, Zheng <zyan@redhat.com>
      c4d4a582
    • Yan, Zheng's avatar
      ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_sync) · d3383a8e
      Yan, Zheng authored
      check_cap_flush() calls mutex_lock(), which may block. So we can't
      use it as condition check function for wait_event();
      Signed-off-by: default avatarYan, Zheng <zyan@redhat.com>
      d3383a8e
    • Ilya Dryomov's avatar
      rbd: fix error paths in rbd_dev_refresh() · 73e39e4d
      Ilya Dryomov authored
      header_rwsem should be released on errors.  Also remove useless
      rbd_dev->mapping.size != rbd_dev->header.image_size test.
      Signed-off-by: default avatarIlya Dryomov <idryomov@redhat.com>
      73e39e4d