1. 31 Jul, 2012 9 commits
    • majianpeng's avatar
      raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layer · 3f9e7c14
      majianpeng authored
      Because bios will merge at block-layer,so bios-error may caused by other
      bio which be merged into to the same request.
      Using this flag,it will find exactly error-sector and not do redundant
      operation like re-write and re-read.
      
      V0->V1:Using REQ_FLUSH instead REQ_NOMERGE avoid bio merging at block
      layer.
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      3f9e7c14
    • Shaohua Li's avatar
      md/raid1: prevent merging too large request · 12cee5a8
      Shaohua Li authored
      For SSD, if request size exceeds specific value (optimal io size), request size
      isn't important for bandwidth. In such condition, if making request size bigger
      will cause some disks idle, the total throughput will actually drop. A good
      example is doing a readahead in a two-disk raid1 setup.
      
      So when should we split big requests? We absolutly don't want to split big
      request to very small requests. Even in SSD, big request transfer is more
      efficient. This patch only considers request with size above optimal io size.
      
      If all disks are busy, is it worth doing a split? Say optimal io size is 16k,
      two requests 32k and two disks. We can let each disk run one 32k request, or
      split the requests to 4 16k requests and each disk runs two. It's hard to say
      which case is better, depending on hardware.
      
      So only consider case where there are idle disks. For readahead, split is
      always better in this case. And in my test, below patch can improve > 30%
      thoughput. Hmm, not 100%, because disk isn't 100% busy.
      
      Such case can happen not just in readahead, for example, in directio. But I
      suppose directio usually will have bigger IO depth and make all disks busy, so
      I ignored it.
      
      Note: if the raid uses any hard disk, we don't prevent merging. That will make
      performace worse.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      12cee5a8
    • Shaohua Li's avatar
      md/raid1: read balance chooses idlest disk for SSD · 9dedf603
      Shaohua Li authored
      SSD hasn't spindle, distance between requests means nothing. And the original
      distance based algorithm sometimes can cause severe performance issue for SSD
      raid.
      
      Considering two thread groups, one accesses file A, the other access file B.
      The first group will access one disk and the second will access the other disk,
      because requests are near from one group and far between groups. In this case,
      read balance might keep one disk very busy but the other relative idle.  For
      SSD, we should try best to distribute requests to as many disks as possible.
      There isn't spindle move penality anyway.
      
      With below patch, I can see more than 50% throughput improvement sometimes
      depending on workloads.
      
      The only exception is small requests can be merged to a big request which
      typically can drive higher throughput for SSD too. Such small requests are
      sequential reads. Unlike hard disk, sequential read which can't be merged (for
      example direct IO, or read without readahead) can be ignored for SSD. Again
      there is no spindle move penality. readahead dispatches small requests and such
      requests can be merged.
      
      Last patch can help detect sequential read well, at least if concurrent read
      number isn't greater than raid disk number. In that case, distance based
      algorithm doesn't work well too.
      
      V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for
      random IO too. This makes the algorithm generic for raid with SSD.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      9dedf603
    • Shaohua Li's avatar
      md/raid1: make sequential read detection per disk based · be4d3280
      Shaohua Li authored
      Currently the sequential read detection is global wide. It's natural to make it
      per disk based, which can improve the detection for concurrent multiple
      sequential reads. And next patch will make SSD read balance not use distance
      based algorithm, where this change help detect truly sequential read for SSD.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      be4d3280
    • Jonathan Brassow's avatar
      MD RAID10: Export md_raid10_congested · cc4d1efd
      Jonathan Brassow authored
      md/raid10: Export is_congested test.
      
      In similar fashion to commits
      	11d8a6e3
      	1ed7242e
      we export the RAID10 congestion checking function so that dm-raid.c can
      make use of it and make use of the personality.  The 'queue' and 'gendisk'
      structures will not be available to the MD code when device-mapper sets
      up the device, so we conditionalize access to these fields also.
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      cc4d1efd
    • Jonathan Brassow's avatar
      MD: Move macros from raid1*.h to raid1*.c · 473e87ce
      Jonathan Brassow authored
      MD RAID1/RAID10: Move some macros from .h file to .c file
      
      There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined
      in both raid1.h and raid10.h.  They are only used in there respective .c files.
      However, if we wish to make RAID10 accessible to the device-mapper RAID
      target (dm-raid.c), then we need to move these macros into the .c files where
      they are used so that they do not conflict with each other.
      
      The macros from the two files are identical and could be moved into md.h, but
      I chose to leave the duplication and have them remain in the personality
      files.
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      473e87ce
    • Jonathan Brassow's avatar
      MD RAID1: rename mirror_info structure · 0eaf822c
      Jonathan Brassow authored
      MD RAID1: Rename the structure 'mirror_info' to 'raid1_info'
      
      The same structure name ('mirror_info') is used by raid10.  Each of these
      structures are defined in there respective header files.  If dm-raid is
      to support both RAID1 and RAID10, the header files will be included and
      the structure names must not collide.  While only one of these structure
      names needs to change, this patch adds consistency to the naming of the
      structure.
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      0eaf822c
    • Jonathan Brassow's avatar
      MD RAID10: rename mirror_info structure · dc280d98
      Jonathan Brassow authored
      MD RAID10: Rename the structure 'mirror_info' to 'raid10_info'
      
      The same structure name ('mirror_info') is used by raid1.  Each of these
      structures are defined in there respective header files.  If dm-raid is
      to support both RAID1 and RAID10, the header files will be included and
      the structure names must not collide.
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      dc280d98
    • Jonathan Brassow's avatar
      MD RAID10: Fix compiler warning. · 3bbae04b
      Jonathan Brassow authored
      MD RAID10:  Fix compiler warning.
      
      Initialize variable to prevent compiler warning.
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      3bbae04b
  2. 19 Jul, 2012 7 commits
    • Shaohua Li's avatar
      raid5: add a per-stripe lock · b17459c0
      Shaohua Li authored
      Add a per-stripe lock to protect stripe specific data. The purpose is to reduce
      lock contention of conf->device_lock.
      
      stripe ->toread, ->towrite are protected by per-stripe lock.  Accessing bio
      list of the stripe is always serialized by this lock, so adding bio to the
      lists (add_stripe_bio()) and removing bio from the lists (like
      ops_run_biofill()) not race.
      
      If bio in ->read, ->written ... list are not shared by multiple stripes, we
      don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will
      protect them. If the bio are shared,  there are two protections:
      1. bi_phys_segments acts as a reference count
      2. traverse the list uses r5_next_bio, which makes traverse never access bio
      not belonging to the stripe
      
      Let's have an example:
      |  stripe1 |  stripe2    |  stripe3  |
      ...bio1......|bio2|bio3|....bio4.....
      
      stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for
      all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to
      bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2
      because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and
      stripe3 will end_bio bio4.
      
      before add_stripe_bio() addes a bio to a stripe, we already increament the bio
      bi_phys_segments, so don't worry other stripes release the bio.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      b17459c0
    • Shaohua Li's avatar
      raid5: remove unnecessary bitmap write optimization · 7eaf7e8e
      Shaohua Li authored
      Neil pointed out the bitmap write optimization in handle_stripe_clean_event()
      is unnecessary, because the chance one stripe gets written twice in the mean
      time is rare. We can always do a bitmap_startwrite when a write request is
      added to a stripe and bitmap_endwrite after write request is done.  Delete the
      optimization. With it, we can delete some cases of device_lock.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      7eaf7e8e
    • Shaohua Li's avatar
      raid5: lockless access raid5 overrided bi_phys_segments · e7836bd6
      Shaohua Li authored
      Raid5 overrides bio->bi_phys_segments, accessing it is with device_lock hold,
      which is unnecessary, We can make it lockless actually.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      e7836bd6
    • Shaohua Li's avatar
      raid5: reduce chance release_stripe() taking device_lock · 4eb788df
      Shaohua Li authored
      release_stripe() is a place conf->device_lock is heavily contended. We take the
      lock even stripe count isn't 1, which isn't required.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      4eb788df
    • NeilBrown's avatar
      md/raid1: close some possible races on write errors during resync · 58e94ae1
      NeilBrown authored
      commit 4367af55
         md/raid1: clear bad-block record when write succeeds.
      
      Added a 'reschedule_retry' call possibility at the end of
      end_sync_write, but didn't add matching code at the end of
      sync_request_write.  So if the writes complete very quickly, or
      scheduling makes it seem that way, then we can miss rescheduling
      the request and the resync could hang.
      
      Also commit 73d5c38a
          md: avoid races when stopping resync.
      
      Fix a race condition in this same code in end_sync_write but didn't
      make the change in sync_request_write.
      
      This patch updates sync_request_write to fix both of those.
      Patch is suitable for 3.1 and later kernels.
      Reported-by: default avatarAlexander Lyakas <alex.bolshoy@gmail.com>
      Original-version-by: default avatarAlexander Lyakas <alex.bolshoy@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      58e94ae1
    • NeilBrown's avatar
      md: avoid crash when stopping md array races with closing other open fds. · a05b7ea0
      NeilBrown authored
      md will refuse to stop an array if any other fd (or mounted fs) is
      using it.
      When any fs is unmounted of when the last open fd is closed all
      pending IO will be flushed (e.g. sync_blockdev call in __blkdev_put)
      so there will be no pending IO to worry about when the array is
      stopped.
      
      However in order to send the STOP_ARRAY ioctl to stop the array one
      must first get and open fd on the block device.
      If some fd is being used to write to the block device and it is closed
      after mdadm open the block device, but before mdadm issues the
      STOP_ARRAY ioctl, then there will be no last-close on the md device so
      __blkdev_put will not call sync_blockdev.
      
      If this happens, then IO can still be in-flight while md tears down
      the array and bad things can happen (use-after-free and subsequent
      havoc).
      
      So in the case where do_md_stop is being called from an open file
      descriptor, call sync_block after taking the mutex to ensure there
      will be no new openers.
      
      This is needed when setting a read-write device to read-only too.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarmajianpeng <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      a05b7ea0
    • NeilBrown's avatar
      md: fix bug in handling of new_data_offset · 25f7fd47
      NeilBrown authored
      commit c6563a8c
          md: add possibility to change data-offset for devices.
      
      introduced a 'new_data_offset' attribute which should normally
      be the same as 'data_offset', but can be explicitly set to a different
      value to allow a reshape operation to move the data.
      
      Unfortunately when the 'data_offset' is explicitly set through
      sysfs, the new_data_offset is not also set, so the two would become
      out-of-sync incorrectly.
      
      One result of this is that trying to set the 'size' after the
      'data_offset' would fail because it is not permitted to set the size
      when the 'data_offset' and 'new_data_offset' are different - as that
      can be confusing.
      Consequently when mdadm tried to do this while assembling an IMSM
      array it would fail.
      
      This bug was introduced in 3.5-rc1.
      Reported-by: default avatarBrian Downing <bdowning@lavos.net>
      Bisected-by: default avatarBrian Downing <bdowning@lavos.net>
      Tested-by: default avatarBrian Downing <bdowning@lavos.net>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      25f7fd47
  3. 14 Jul, 2012 11 commits
  4. 13 Jul, 2012 13 commits
    • Linus Torvalds's avatar
      Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · d55e5bd0
      Linus Torvalds authored
      Pull the leap second fixes from Thomas Gleixner:
       "It's a rather large series, but well discussed, refined and reviewed.
        It got a massive testing by John, Prarit and tip.
      
        In theory we could split it into two parts.  The first two patches
      
          f55a6faa: hrtimer: Provide clock_was_set_delayed()
          4873fa07: timekeeping: Fix leapsecond triggered load spike issue
      
        are merely preventing the stuff loops forever issues, which people
        have observed.
      
        But there is no point in delaying the other 4 commits which achieve
        full correctness into 3.6 as they are tagged for stable anyway.  And I
        rather prefer to have the full fixes merged in bulk than a "prevent
        the observable wreckage and deal with the hidden fallout later"
        approach."
      
      * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        hrtimer: Update hrtimer base offsets each hrtimer_interrupt
        timekeeping: Provide hrtimer update function
        hrtimers: Move lock held region in hrtimer_interrupt()
        timekeeping: Maintain ktime_t based offsets for hrtimers
        timekeeping: Fix leapsecond triggered load spike issue
        hrtimer: Provide clock_was_set_delayed()
      d55e5bd0
    • Will Drewry's avatar
      x86/vsyscall: allow seccomp filter in vsyscall=emulate · 5651721e
      Will Drewry authored
      If a seccomp filter program is installed, older static binaries and
      distributions with older libc implementations (glibc 2.13 and earlier)
      that rely on vsyscall use will be terminated regardless of the filter
      program policy when executing time, gettimeofday, or getcpu.  This is
      only the case when vsyscall emulation is in use (vsyscall=emulate is the
      default).
      
      This patch emulates system call entry inside a vsyscall=emulate by
      populating regs->ax and regs->orig_ax with the system call number prior
      to calling into seccomp such that all seccomp-dependencies function
      normally.  Additionally, system call return behavior is emulated in line
      with other vsyscall entrypoints for the trace/trap cases.
      
      [ v2: fixed ip and sp on SECCOMP_RET_TRAP/TRACE (thanks to luto@mit.edu) ]
      Reported-and-tested-by: default avatarOwen Kibel <qmewlo@gmail.com>
      Signed-off-by: default avatarWill Drewry <wad@chromium.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5651721e
    • Linus Torvalds's avatar
      Merge branch 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging · ac7d181e
      Linus Torvalds authored
      Please pull one hwmon subsystem fix from Jean Delvare.
      
      * 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
        hwmon: (it87) Preserve configuration register bits on init
      ac7d181e
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-3.5-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · 4264e6a2
      Linus Torvalds authored
      Pull NFS client bugfixes from Trond Myklebust:
       - Fix an NFSv4 mount regression
       - Fix O_DIRECT list manipulation snafus
      
      * tag 'nfs-for-3.5-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
        NFSv4: Fix an NFSv4 mount regression
        NFS: Fix list manipulation snafus in fs/nfs/direct.c
      4264e6a2
    • Dave Jones's avatar
      Remove easily user-triggerable BUG from generic_setlease · 8d657eb3
      Dave Jones authored
      This can be trivially triggered from userspace by passing in something unexpected.
      
          kernel BUG at fs/locks.c:1468!
          invalid opcode: 0000 [#1] SMP
          RIP: 0010:generic_setlease+0xc2/0x100
          Call Trace:
            __vfs_setlease+0x35/0x40
            fcntl_setlease+0x76/0x150
            sys_fcntl+0x1c6/0x810
            system_call_fastpath+0x1a/0x1f
      Signed-off-by: default avatarDave Jones <davej@redhat.com>
      Cc: stable@kernel.org # 3.2+
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d657eb3
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · 39ea32ca
      Linus Torvalds authored
      Pull input layer fixes from Dmitry Torokhov:
       "The changes are limited to adding new VID/PID combinations to drivers
        to enable support for new versions of hardware, most notably hardware
        found in new MacBook Pro Retina boxes."
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
        Input: xpad - add Andamiro Pump It Up pad
        Input: xpad - add signature for Razer Onza Tournament Edition
        Input: xpad - handle all variations of Mad Catz Beat Pad
        Input: bcm5974 - Add support for 2012 MacBook Pro Retina
        HID: add support for 2012 MacBook Pro Retina
      39ea32ca
    • Linus Torvalds's avatar
      Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · 8488e408
      Linus Torvalds authored
      Pull media fixes from Mauro Carvalho Chehab:
       - Some regression fixes at the audio part for devices with
         cx23885/cx25840
       - A DMA corruption fix at cx231xx
       - two fixes at the winbond IR driver
       - Several fixes for the EXYNOS media driver (s5p)
       - two fixes at the OMAP3 preview driver
       - one fix at the dvb core failure path
       - an include missing (slab.h) at smiapp-core causing compilation
         breakage
       - em28xx was not loading the IR driver driver anymore.
      
      * 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (31 commits)
        [media] Revert "[media] V4L: JPEG class documentation corrections"
        [media] s5p-fimc: Add missing FIMC-LITE file operations locking
        [media] omap3isp: preview: Fix contrast and brightness handling
        [media] omap3isp: preview: Fix output size computation depending on input format
        [media] winbond-cir: Initialise timeout, driver_type and allowed_protos
        [media] winbond-cir: Fix txandrx module info
        [media] cx23885: Silence unknown command warnings
        [media] cx23885: add support for HVR-1255 analog (cx23888 variant)
        [media] cx23885: make analog support work for HVR_1250 (cx23885 variant)
        [media] cx25840: fix vsrc/hsrc usage on cx23888 designs
        [media] cx25840: fix regression in HVR-1800 analog audio
        [media] cx25840: fix regression in analog support hue/saturation controls
        [media] cx25840: fix regression in HVR-1800 analog support
        [media] s5p-mfc: Fixed setup of custom controls in decoder and encoder
        [media] cx231xx: don't DMA to random addresses
        [media] em28xx: fix em28xx-rc load
        [media] dvb-core: Release semaphore on error path dvb_register_device()
        [media] s5p-fimc: Stop media entity pipeline if fimc_pipeline_validate fails
        [media] s5p-fimc: Fix compiler warning in fimc-lite.c
        [media] s5p-fimc: media_entity_pipeline_start() may fail
        ...
      8488e408
    • Linus Torvalds's avatar
      Merge tag 'mmc-fixes-for-3.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc · 2c913900
      Linus Torvalds authored
      Pull MMC fixes from Chris Ball:
       - Revert a patch that made failing to select power class fatal;
         it turns out that it fails non-fatally on Tegra boards.
         Regression against 3.5-rc1.
       - Add the IRQF_ONESHOT flag to the cd-gpio driver, which turned
         into a regression in 3.5-rc1 when IRQF_ONESHOT became required
         for threaded IRQs with no handler.
      
      * tag 'mmc-fixes-for-3.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc:
        mmc: cd-gpio: pass IRQF_ONESHOT to request_threaded_irq()
        mmc: core: Revert "skip card initialization if power class selection fails"
      2c913900
    • Linus Torvalds's avatar
      Merge tag 'for-linus-20120712' of git://git.infradead.org/linux-mtd · 36ec9fbf
      Linus Torvalds authored
      Pull late MTD fixes from David Woodhouse:
       - fix 'sparse warning fix' regression which totally breaks MXC NAND
       - fix GPMI NAND regression when used with UBI
       - update/correct sysfs documentation for new 'bitflip_threshold' field
       - fix nandsim build failure
      
      * tag 'for-linus-20120712' of git://git.infradead.org/linux-mtd:
        mtd: nandsim: don't open code a do_div helper
        mtd: ABI documentation: clarification of bitflip_threshold
        mtd: gpmi-nand: fix read page when reading to vmalloced area
        mtd: mxc_nand: use 32bit copy functions
      36ec9fbf
    • Linus Torvalds's avatar
      Merge tag 'mfd-for-linus-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6 · 7801dc33
      Linus Torvalds authored
      Pull MFD Fixes from Samuel Ortiz:
       - Three Palmas fixes, One of them being a build error fix.
       - Two mc13xx fixes.  One for fixing an SPI regmap configuration and
         another one for working around an i.Mx hardware bug.
       - One omap-usb regression fix.
       - One twl6040 build breakage fix.
       - One file deletion (ab5500-core.h) that was overlooked during the last
         merge window.
      
      * tag 'mfd-for-linus-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6:
        mfd: Add missing hunk to change palmas irq to clear on read
        mfd: Fix palmas regulator pdata missing
        mfd: USB: Fix the omap-usb EHCI ULPI PHY reset fix issues.
        mfd: Update twl6040 Kconfig to avoid build breakage
        mfd: Delete ab5500-core.h
        mfd: mc13xxx workaround SPI hardware bug on i.Mx
        mfd: Fix mc13xxx SPI regmap
        mfd: Add terminating entry for i2c_device_id palmas table
      7801dc33
    • Linus Torvalds's avatar
      Merge tag 'sh-for-linus' of git://github.com/pmundt/linux-sh · 68394bfb
      Linus Torvalds authored
      Pull SuperH fixes from Paul Mundt.
      
      * tag 'sh-for-linus' of git://github.com/pmundt/linux-sh:
        SH: Convert out[bwl] macros to inline functions
        sh: Fix up se7721 GPIOLIB=y build warnings.
      68394bfb
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/virt/kvm/kvm · e7654c1e
      Linus Torvalds authored
      Pull a couple of KVM fixes from Avi Kivity:
       "One is an adjustment for an irq layer change that affected device
        assignment, the other a one-liner ppc fix."
      
      * git://git.kernel.org/pub/scm/virt/kvm/kvm:
        powerpc/kvm: Fix "PR" KVM implementation of H_CEDE
        KVM: Fix device assignment threaded irq handler
      e7654c1e
    • Jeff Moyer's avatar
      block: fix infinite loop in __getblk_slow · 91f68c89
      Jeff Moyer authored
      Commit 080399aa ("block: don't mark buffers beyond end of disk as
      mapped") exposed a bug in __getblk_slow that causes mount to hang as it
      loops infinitely waiting for a buffer that lies beyond the end of the
      disk to become uptodate.
      
      The problem was initially reported by Torsten Hilbrich here:
      
          https://lkml.org/lkml/2012/6/18/54
      
      and also reported independently here:
      
          http://www.sysresccd.org/forums/viewtopic.php?f=13&t=4511
      
      and then Richard W.M.  Jones and Marcos Mello noted a few separate
      bugzillas also associated with the same issue.  This patch has been
      confirmed to fix:
      
          https://bugzilla.redhat.com/show_bug.cgi?id=835019
      
      The main problem is here, in __getblk_slow:
      
              for (;;) {
                      struct buffer_head * bh;
                      int ret;
      
                      bh = __find_get_block(bdev, block, size);
                      if (bh)
                              return bh;
      
                      ret = grow_buffers(bdev, block, size);
                      if (ret < 0)
                              return NULL;
                      if (ret == 0)
                              free_more_memory();
              }
      
      __find_get_block does not find the block, since it will not be marked as
      mapped, and so grow_buffers is called to fill in the buffers for the
      associated page.  I believe the for (;;) loop is there primarily to
      retry in the case of memory pressure keeping grow_buffers from
      succeeding.  However, we also continue to loop for other cases, like the
      block lying beond the end of the disk.  So, the fix I came up with is to
      only loop when grow_buffers fails due to memory allocation issues
      (return value of 0).
      
      The attached patch was tested by myself, Torsten, and Rich, and was
      found to resolve the problem in call cases.
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Reported-and-Tested-by: default avatarTorsten Hilbrich <torsten.hilbrich@secunet.com>
      Tested-by: default avatarRichard W.M. Jones <rjones@redhat.com>
      Reviewed-by: default avatarJosh Boyer <jwboyer@redhat.com>
      Cc: Stable <stable@vger.kernel.org>  # 3.0+
      [ Jens is on vacation, taking this directly  - Linus ]
      --
      Stable Notes: this patch requires backport to 3.0, 3.2 and 3.3.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91f68c89