1. 10 Dec, 2006 40 commits
    • Avi Kivity's avatar
      [PATCH] kvm: userspace interface · 6aa8b732
      Avi Kivity authored
      web site: http://kvm.sourceforge.net
      
      mailing list: kvm-devel@lists.sourceforge.net
        (http://lists.sourceforge.net/lists/listinfo/kvm-devel)
      
      The following patchset adds a driver for Intel's hardware virtualization
      extensions to the x86 architecture.  The driver adds a character device
      (/dev/kvm) that exposes the virtualization capabilities to userspace.  Using
      this driver, a process can run a virtual machine (a "guest") in a fully
      virtualized PC containing its own virtual hard disks, network adapters, and
      display.
      
      Using this driver, one can start multiple virtual machines on a host.
      
      Each virtual machine is a process on the host; a virtual cpu is a thread in
      that process.  kill(1), nice(1), top(1) work as expected.  In effect, the
      driver adds a third execution mode to the existing two: we now have kernel
      mode, user mode, and guest mode.  Guest mode has its own address space mapping
      guest physical memory (which is accessible to user mode by mmap()ing
      /dev/kvm).  Guest mode has no access to any I/O devices; any such access is
      intercepted and directed to user mode for emulation.
      
      The driver supports i386 and x86_64 hosts and guests.  All combinations are
      allowed except x86_64 guest on i386 host.  For i386 guests and hosts, both pae
      and non-pae paging modes are supported.
      
      SMP hosts and UP guests are supported.  At the moment only Intel
      hardware is supported, but AMD virtualization support is being worked on.
      
      Performance currently is non-stellar due to the naive implementation of the
      mmu virtualization, which throws away most of the shadow page table entries
      every context switch.  We plan to address this in two ways:
      
      - cache shadow page tables across tlb flushes
      - wait until AMD and Intel release processors with nested page tables
      
      Currently a virtual desktop is responsive but consumes a lot of CPU.  Under
      Windows I tried playing pinball and watching a few flash movies; with a recent
      CPU one can hardly feel the virtualization.  Linux/X is slower, probably due
      to X being in a separate process.
      
      In addition to the driver, you need a slightly modified qemu to provide I/O
      device emulation and the BIOS.
      
      Caveats (akpm: might no longer be true):
      
      - The Windows install currently bluescreens due to a problem with the
        virtual APIC.  We are working on a fix.  A temporary workaround is to
        use an existing image or install through qemu
      - Windows 64-bit does not work.  That's also true for qemu, so it's
        probably a problem with the device model.
      
      [bero@arklinux.org: build fix]
      [simon.kagstrom@bth.se: build fix, other fixes]
      [uril@qumranet.com: KVM: Expose interrupt bitmap]
      [akpm@osdl.org: i386 build fix]
      [mingo@elte.hu: i386 fixes]
      [rdreier@cisco.com: add log levels to all printks]
      [randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
      [anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
      Signed-off-by: default avatarYaniv Kamay <yaniv@qumranet.com>
      Signed-off-by: default avatarAvi Kivity <avi@qumranet.com>
      Cc: Simon Kagstrom <simon.kagstrom@bth.se>
      Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
      Signed-off-by: default avatarUri Lublin <uril@qumranet.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Roland Dreier <rolandd@cisco.com>
      Signed-off-by: default avatarRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: default avatarAnthony Liguori <anthony@codemonkey.ws>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      6aa8b732
    • Daniel Walker's avatar
      [PATCH] clocksource: small cleanup · f5f1a24a
      Daniel Walker authored
      Mostly changing alignment.  Just some general cleanup.
      
      [akpm@osdl.org: build fix]
      Signed-off-by: default avatarDaniel Walker <dwalker@mvista.com>
      Acked-by: default avatarJohn Stultz <johnstul@us.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f5f1a24a
    • Daniel Walker's avatar
      [PATCH] clocksource: add usage of CONFIG_SYSFS · 2b013700
      Daniel Walker authored
      Simply adds some ifdefs to remove clocksoure sysfs code when CONFIG_SYSFS
      isn't turn on.
      Signed-off-by: default avatarDaniel Walker <dwalker@mvista.com>
      Acked-by: default avatarJohn Stultz <johnstul@us.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      2b013700
    • Arjan van de Ven's avatar
      [PATCH] user of the jiffies rounding patch: Slab · 2b284214
      Arjan van de Ven authored
      This patch introduces users of the round_jiffies() function in the slab code.
      
      The slab code has a few "run every second" timers for background work; these
      are obviously not timing critical as long as they happen roughly at the right
      frequency.
      Signed-off-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      2b284214
    • Arjan van de Ven's avatar
      [PATCH] user of the jiffies rounding code: JBD · 44d306e1
      Arjan van de Ven authored
      This patch introduces a user: of the round_jiffies() function; the "5 second"
      ext3/jbd wakeup.
      
      While "every 5 seconds" doesn't sound as a problem, there can be many of these
      (and these timers do add up over all the kernel).  The "5 second" wakeup isn't
      really timing sensitive; in addition even with rounding it'll still happen
      every 5 seconds (with the exception of the very first time, which is likely to
      be rounded up to somewhere closer to 6 seconds)
      Signed-off-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      44d306e1
    • Arjan van de Ven's avatar
      [PATCH] round_jiffies infrastructure · 4c36a5de
      Arjan van de Ven authored
      Introduce a round_jiffies() function as well as a round_jiffies_relative()
      function.  These functions round a jiffies value to the next whole second.
      The primary purpose of this rounding is to cause all "we don't care exactly
      when" timers to happen at the same jiffy.
      
      This avoids multiple timers firing within the second for no real reason;
      with dynamic ticks these extra timers cause wakeups from deep sleep CPU
      sleep states and thus waste power.
      
      The exact wakeup moment is skewed by the cpu number, to avoid all cpus from
      waking up at the exact same time (and hitting the same lock/cachelines
      there)
      
      [akpm@osdl.org: fix variable type]
      Signed-off-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4c36a5de
    • Vadim Lobanov's avatar
      [PATCH] fdtable: Implement new pagesize-based fdtable allocator · 5466b456
      Vadim Lobanov authored
      This patch provides an improved fdtable allocation scheme, useful for
      expanding fdtable file descriptor entries.  The main focus is on the fdarray,
      as its memory usage grows 128 times faster than that of an fdset.
      
      The allocation algorithm sizes the fdarray in such a way that its memory usage
      increases in easy page-sized chunks. The overall algorithm expands the allowed
      size in powers of two, in order to amortize the cost of invoking vmalloc() for
      larger allocation sizes. Namely, the following sizes for the fdarray are
      considered, and the smallest that accommodates the requested fd count is
      chosen:
      
          pagesize / 4
          pagesize / 2
          pagesize      <- memory allocator switch point
          pagesize * 2
          pagesize * 4
          ...etc...
      
      Unlike the current implementation, this allocation scheme does not require a
      loop to compute the optimal fdarray size, and can be done in efficient
      straightline code.
      
      Furthermore, since the fdarray overflows the pagesize boundary long before any
      of the fdsets do, it makes sense to optimize run-time by allocating both
      fdsets in a single swoop.  Even together, they will still be, by far, smaller
      than the fdarray.  The fdtable->open_fds is now used as the anchor for the
      fdset memory allocation.
      Signed-off-by: default avatarVadim Lobanov <vlobanov@speakeasy.net>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      5466b456
    • Vadim Lobanov's avatar
      [PATCH] fdtable: Remove the free_files field · 4fd45812
      Vadim Lobanov authored
      An fdtable can either be embedded inside a files_struct or standalone (after
      being expanded).  When an fdtable is being discarded after all RCU references
      to it have expired, we must either free it directly, in the standalone case,
      or free the files_struct it is contained within, in the embedded case.
      
      Currently the free_files field controls this behavior, but we can get rid of
      it entirely, as all the necessary information is already recorded.  We can
      distinguish embedded and standalone fdtables using max_fds, and if it is
      embedded we can divine the relevant files_struct using container_of().
      Signed-off-by: default avatarVadim Lobanov <vlobanov@speakeasy.net>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4fd45812
    • Vadim Lobanov's avatar
      [PATCH] fdtable: Make fdarray and fdsets equal in size · bbea9f69
      Vadim Lobanov authored
      Currently, each fdtable supports three dynamically-sized arrays of data: the
      fdarray and two fdsets.  The code allows the number of fds supported by the
      fdarray (fdtable->max_fds) to differ from the number of fds supported by each
      of the fdsets (fdtable->max_fdset).
      
      In practice, it is wasteful for these two sizes to differ: whenever we hit a
      limit on the smaller-capacity structure, we will reallocate the entire fdtable
      and all the dynamic arrays within it, so any delta in the memory used by the
      larger-capacity structure will never be touched at all.
      
      Rather than hogging this excess, we shouldn't even allocate it in the first
      place, and keep the capacities of the fdarray and the fdsets equal.  This
      patch removes fdtable->max_fdset.  As an added bonus, most of the supporting
      code becomes simpler.
      Signed-off-by: default avatarVadim Lobanov <vlobanov@speakeasy.net>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      bbea9f69
    • Vadim Lobanov's avatar
      [PATCH] fdtable: Delete pointless code in dup_fd() · f3d19c90
      Vadim Lobanov authored
      The dup_fd() function creates a new files_struct and fdtable embedded inside
      that files_struct, and then possibly expands the fdtable using expand_files().
      
      The out_release error path is invoked when expand_files() returns an error
      code.  However, when this attempt to expand fails, the fdtable is left in its
      original embedded form, so it is pointless to try to free the associated
      fdarray and fdsets.
      Signed-off-by: default avatarVadim Lobanov <vlobanov@speakeasy.net>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f3d19c90
    • Zach Brown's avatar
      [PATCH] dio: lock refcount operations · 5eb6c7a2
      Zach Brown authored
      The wait_for_more_bios() function name was poorly chosen.  While looking to
      clean it up it I noticed that the dio struct refcounting between the bio
      completion and dio submission paths was racey.
      
      The bio submission path was simply freeing the dio struct if
      atomic_dec_and_test() indicated that it dropped the final reference.
      
      The aio bio completion path was dereferencing its dio struct pointer *after
      dropping its reference* based on the remaining number of references.
      
      These two paths could race and result in the aio bio completion path
      dereferencing a freed dio, though this was not observed in the wild.
      
      This moves the refcount under the bio lock so that bio completion can drop
      its reference and decide to wake all in one atomic step.
      
      Once testing and waking is locked dio_await_one() can test its sleeping
      condition and mark itself uninterruptible under the lock.  It gets simpler
      and wait_for_more_bios() disappears.
      
      The addition of the interrupt masking spin lock acquiry in dio_bio_submit()
      looks alarming.  This lock acquiry existed in that path before the recent
      dio completion patch set.  We shouldn't expect significant performance
      regression from returning to the behaviour that existed before the
      completion clean up work.
      
      This passed 4k block ext3 O_DIRECT fsx and aio-stress on an SMP machine.
      Signed-off-by: default avatarZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: <xfs-masters@oss.sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      5eb6c7a2
    • Zach Brown's avatar
      [PATCH] dio: only call aio_complete() after returning -EIOCBQUEUED · 8459d86a
      Zach Brown authored
      The only time it is safe to call aio_complete() is when the ->ki_retry
      function returns -EIOCBQUEUED to the AIO core.  direct_io_worker() has
      historically done this by relying on its caller to translate positive return
      codes into -EIOCBQUEUED for the aio case.  It did this by trying to keep
      conditionals in sync.  direct_io_worker() knew when finished_one_bio() was
      going to call aio_complete().  It would reverse the test and wait and free the
      dio in the cases it thought that finished_one_bio() wasn't going to.
      
      Not surprisingly, it ended up getting it wrong.  'ret' could be a negative
      errno from the submission path but it failed to communicate this to
      finished_one_bio().  direct_io_worker() would return < 0, it's callers
      wouldn't raise -EIOCBQUEUED, and aio_complete() would be called.  In the
      future finished_one_bio()'s tests wouldn't reflect this and aio_complete()
      would be called for a second time which can manifest as an oops.
      
      The previous cleanups have whittled the sync and async completion paths down
      to the point where we can collapse them and clearly reassert the invariant
      that we must only call aio_complete() after returning -EIOCBQUEUED.
      direct_io_worker() will only return -EIOCBQUEUED when it is not the last to
      drop the dio refcount and the aio bio completion path will only call
      aio_complete() when it is the last to drop the dio refcount.
      direct_io_worker() can ensure that it is the last to drop the reference count
      by waiting for bios to drain.  It does this for sync ops, of course, and for
      partial dio writes that must fall back to buffered and for aio ops that saw
      errors during submission.
      
      This means that operations that end up waiting, even if they were issued as
      aio ops, will not call aio_complete() from dio.  Instead we return the return
      code of the operation and let the aio core call aio_complete().  This is
      purposely done to fix a bug where AIO DIO file extensions would call
      aio_complete() before their callers have a chance to update i_size.
      
      Now that direct_io_worker() is explicitly returning -EIOCBQUEUED its callers
      no longer have to translate for it.  XFS needs to be careful not to free
      resources that will be used during AIO completion if -EIOCBQUEUED is returned.
       We maintain the previous behaviour of trying to write fs metadata for O_SYNC
      aio+dio writes.
      Signed-off-by: default avatarZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: <xfs-masters@oss.sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      8459d86a
    • Zach Brown's avatar
      [PATCH] dio: remove duplicate bio wait code · 20258b2b
      Zach Brown authored
      Now that we have a single refcount and waiting path we can reuse it in the
      async 'should_wait' path.  It continues to rely on the fragile link between
      the conditional in dio_complete_aio() which decides to complete the AIO and
      the conditional in direct_io_worker() which decides to wait and free.
      
      By waiting before dropping the reference we stop dio_bio_end_aio() from
      calling dio_complete_aio() which used to wake up the waiter after seeing the
      reference count drop to 0.  We hoist this wake up into dio_bio_end_aio() which
      now notices when it's left a single remaining reference that is held by the
      waiter.
      Signed-off-by: default avatarZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      20258b2b
    • Zach Brown's avatar
      [PATCH] dio: formalize bio counters as a dio reference count · 0273201e
      Zach Brown authored
      Previously we had two confusing counts of bio progress.  'bio_count' was
      decremented as bios were processed and freed by the dio core.  It was used to
      indicate final completion of the dio operation.  'bios_in_flight' reflected
      how many bios were between submit_bio() and bio->end_io.  It was used by the
      sync path to decide when to wake up and finish completing bios and was ignored
      by the async path.
      
      This patch collapses the two notions into one notion of a dio reference count.
       bios hold a dio reference when they're between submit_bio and bio->end_io.
      
      Since bios_in_flight was only used in the sync path it is now equivalent to
      dio->refcount - 1 which accounts for direct_io_worker() holding a reference
      for the duration of the operation.
      
      dio_bio_complete() -> finished_one_bio() was called from the sync path after
      finding bios on the list that the bio->end_io function had deposited.
      finished_one_bio() can not drop the dio reference on behalf of these bios now
      because bio->end_io already has.  The is_async test in finished_one_bio()
      meant that it never actually did anything other than drop the bio_count for
      sync callers.  So we remove its refcount decrement, don't call it from
      dio_bio_complete(), and hoist its call up into the async dio_bio_complete()
      caller after an explicit refcount decrement.  It is renamed dio_complete_aio()
      to reflect the remaining work it actually does.
      Signed-off-by: default avatarZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      0273201e
    • Zach Brown's avatar
      [PATCH] dio: call blk_run_address_space() once per op · 17a7b1d7
      Zach Brown authored
      We only need to call blk_run_address_space() once after all the bios for the
      direct IO op have been submitted.  This removes the chance of calling
      blk_run_address_space() after spurious wake ups as the sync path waits for
      bios to drain.  It's also one less difference betwen the sync and async paths.
      
      In the process we remove a redundant dio_bio_submit() that its caller had
      already performed.
      Signed-off-by: default avatarZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      17a7b1d7
    • Zach Brown's avatar
      [PATCH] dio: centralize completion in dio_complete() · 6d544bb4
      Zach Brown authored
      There have been a lot of bugs recently due to the way direct_io_worker() tries
      to decide how to finish direct IO operations.  In the worst examples it has
      failed to call aio_complete() at all (hang) or called it too many times
      (oops).
      
      This set of patches cleans up the completion phase with the goal of removing
      the complexity that lead to these bugs.  We end up with one path that
      calculates the result of the operation after all off the bios have completed.
      We decide when to generate a result of the operation using that path based on
      the final release of a refcount on the dio structure.
      
      I tried to progress towards the final state in steps that were relatively easy
      to understand.  Each step should compile but I only tested the final result of
      having all the patches applied.
      
      I've tested these on low end PC drives with aio-stress, the direct IO tests I
      could manage to get running in LTP, orasim, and some home-brew functional
      tests.
      
      In http://lkml.org/lkml/2006/9/21/103 IBM reports success with ext2 and ext3
      running DIO LTP tests.  They found that XFS bug which has since been addressed
      in the patch series.
      
      This patch:
      
      The mechanics which decide the result of a direct IO operation were duplicated
      in the sync and async paths.
      
      The async path didn't check page_errors which can manifest as silently
      returning success when the final pointer in an operation faults and its
      matching file region is filled with zeros.
      
      The sync path and async path differed in whether they passed errors to the
      caller's dio->end_io operation.  The async path was passing errors to it which
      trips an assertion in XFS, though it is apparently harmless.
      
      This centralizes the completion phase of dio ops in one place.  AIO will now
      return EFAULT consistently and all paths fall back to the previously sync
      behaviour of passing the number of bytes 'transferred' to the dio->end_io
      callback, regardless of errors.
      
      dio_await_completion() doesn't have to propogate EIO from non-uptodate bios
      now that it's being propogated through dio_complete() via dio->io_error.  This
      lets it return void which simplifies its sole caller.
      Signed-off-by: default avatarZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      6d544bb4
    • NeilBrown's avatar
      [PATCH] md: assorted md and raid1 one-liners · 17571284
      NeilBrown authored
      Fix few bugs that meant that:
        - superblocks weren't alway written at exactly the right time (this
          could show up if the array was not written to - writting to the array
          causes lots of superblock updates and so hides these errors).
      
        - restarting device recovery after a clean shutdown (version-1 metadata
          only) didn't work as intended (or at all).
      
      1/ Ensure superblock is updated when a new device is added.
      2/ Remove an inappropriate test on MD_RECOVERY_SYNC in md_do_sync.
         The body of this if takes one of two branches depending on whether
         MD_RECOVERY_SYNC is set, so testing it in the clause of the if
         is wrong.
      3/ Flag superblock for updating after a resync/recovery finishes.
      4/ If we find the neeed to restart a recovery in the middle (version-1
         metadata only) make sure a full recovery (not just as guided by
         bitmaps) does get done.
      Signed-off-by: default avatarNeil Brown <neilb@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      17571284
    • NeilBrown's avatar
      [PATCH] md: return a non-zero error to bi_end_io as appropriate in raid5 · c2b00852
      NeilBrown authored
      Currently raid5 depends on clearing the BIO_UPTODATE flag to signal an error
      to higher levels.  While this should be sufficient, it is safer to explicitly
      set the error code as well - less room for confusion.
      Signed-off-by: default avatarNeil Brown <neilb@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c2b00852
    • NeilBrown's avatar
      [PATCH] md: remove some old ifdefed-out code from raid5.c · b8c6b645
      NeilBrown authored
      There are some vestiges of old code that was used for bypassing the stripe
      cache on reads in raid5.c.  This was never updated after the change from
      buffer_heads to bios, but was left as a reminder.
      
      That functionality has nowe been implemented in a completely different way, so
      the old code can go.
      Signed-off-by: default avatarNeil Brown <neilb@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b8c6b645
    • Jeff Garzik's avatar
      [PATCH] MD: conditionalize some code · fdee8ae4
      Jeff Garzik authored
      The autorun code is only used if this module is built into the static
      kernel image.  Adjust #ifdefs accordingly.
      Signed-off-by: default avatarJeff Garzik <jeff@garzik.org>
      Acked-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      fdee8ae4
    • NeilBrown's avatar
      [PATCH] md: fix innocuous bug in raid6 stripe_to_pdidx · b875e531
      NeilBrown authored
      stripe_to_pdidx finds the index of the parity disk for a given stripe.  It
      assumes raid5 in that it uses "disks-1" to determine the number of data disks.
      
      This is incorrect for raid6 but fortunately the two usages cancel each other
      out.  The only way that 'data_disks' affects the calculation of pd_idx in
      raid5_compute_sector is when it is divided into the sector number.  But as
      that sector number is calculated by multiplying in the wrong value of
      'data_disks' the division produces the right value.
      
      So it is innocuous but needs to be fixed.
      
      Also change the calculation of raid_disks in compute_blocknr to make it
      more obviously correct (it seems at first to always use disks-1 too).
      Signed-off-by: default avatarNeil Brown <neilb@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b875e531
    • Raz Ben-Jehuda(caro)'s avatar
      [PATCH] md: enable bypassing cache for reads · 52488615
      Raz Ben-Jehuda(caro) authored
      Call the chunk_aligned_read where appropriate.
      Signed-off-by: default avatarNeil Brown <neilb@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      52488615
    • Raz Ben-Jehuda(caro)'s avatar
      [PATCH] md: allow reads that have bypassed the cache to be retried on failure · 46031f9a
      Raz Ben-Jehuda(caro) authored
      If a bypass-the-cache read fails, we simply try again through the cache.  If
      it fails again it will trigger normal recovery precedures.
      
      update 1:
      
      From: NeilBrown <neilb@suse.de>
      
      1/
        chunk_aligned_read and retry_aligned_read assume that
            data_disks == raid_disks - 1
        which is not true for raid6.
        So when an aligned read request bypasses the cache, we can get the wrong data.
      
      2/ The cloned bio is being used-after-free in raid5_align_endio
         (to test BIO_UPTODATE).
      
      3/ We forgot to add rdev->data_offset when submitting
         a bio for aligned-read
      
      4/ clone_bio calls blk_recount_segments and then we change bi_bdev,
         so we need to invalidate the segment counts.
      
      5/ We don't de-reference the rdev when the read completes.
         This means we need to record the rdev to so it is still
         available in the end_io routine.  Fortunately
         bi_next in the original bio is unused at this point so
         we can stuff it in there.
      
      6/ We leak a cloned bio if the target rdev is not usable.
      
      From: NeilBrown <neilb@suse.de>
      
      update 2:
      
      1/ When aligned requests fail (read error) they need to be retried
         via the normal method (stripe cache).  As we cannot be sure that
         we can process a single read in one go (we may not be able to
         allocate all the stripes needed) we store a bio-being-retried
         and a list of bioes-that-still-need-to-be-retried.
         When find a bio that needs to be retried, we should add it to
         the list, not to single-bio...
      
      2/ We were never incrementing 'scnt' when resubmitting failed
         aligned requests.
      
      [akpm@osdl.org: build fix]
      Signed-off-by: default avatarNeil Brown <neilb@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      46031f9a
    • Raz Ben-Jehuda(caro)'s avatar
    • Raz Ben-Jehuda(caro)'s avatar
      [PATCH] md: define raid5_mergeable_bvec · 23032a0e
      Raz Ben-Jehuda(caro) authored
      This will encourage read request to be on only one device, so we will often be
      able to bypass the cache for read requests.
      Signed-off-by: default avatarNeil Brown <neilb@suse.de>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      23032a0e
    • NeilBrown's avatar
      [PATCH] md: tidy up device-change notification when an md array is stopped · 0d4ca600
      NeilBrown authored
      An md array can be stopped leaving all the setting still in place, or it can
      torn down and destroyed.  set_capacity and other change notifications only
      happen in the latter case, but should happen in both.
      Signed-off-by: default avatarNeil Brown <neilb@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      0d4ca600
    • Paul Mackerras's avatar
      [PATCH] Fbdev driver for IBM GXT4500P videocards · a3d89983
      Paul Mackerras authored
      This is an fbdev driver for the IBM GXT4500P display card found in some IBM
      System P (pSeries) machines.  These cards have hardware 2D and 3D
      capabilities, but the driver does not use them; it just exports a dumb
      framebuffer.
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Acked-by: default avatarJames Simmons <jsimmons@infradead.org>
      Cc: "Antonino A. Daplas" <adaplas@pol.net>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a3d89983
    • Alan Cox's avatar
      [PATCH] ide-cd: Handle strange interrupt on the Intel ESB2 · ee2f344b
      Alan Cox authored
      The ESB2 appears to emit spurious DMA interrupts when configured for native
      mode and handling ATAPI devices.  Stratus were able to pin this bug down and
      produce a patch.  This is a rework which applies the fixup only to the ESB2
      (for now).  We can apply it to other chips later if the same problem is found.
      
      This code has been tested and confirmed to fix the problem on the tested
      systems.
      Signed-off-by: default avatarAlan Cox <alan@redhat.com>
      (Most of the hard work done by Stratus however)
      Cc: Jens Axboe <axboe@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      ee2f344b
    • Miguel Ojeda Sandonis's avatar
      [PATCH] kernel/sched.c: whitespace cleanups · 33859f7f
      Miguel Ojeda Sandonis authored
      [akpm@osdl.org: additional cleanups]
      Signed-off-by: default avatarMiguel Ojeda Sandonis <maxextreme@gmail.com>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      33859f7f
    • Chen, Kenneth W's avatar
      [PATCH] sched: optimize activate_task for RT task · 62ab616d
      Chen, Kenneth W authored
      RT task does not participate in interactiveness priority and thus shouldn't
      be bothered with timestamp and p->sleep_type manipulation when task is
      being put on run queue.  Bypass all of the them with a single if (rt_task)
      test.
      Signed-off-by: default avatarKen Chen <kenneth.w.chen@intel.com>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      62ab616d
    • Chen, Kenneth W's avatar
      [PATCH] sched: remove lb_stopbalance counter · 06066714
      Chen, Kenneth W authored
      Remove scheduler stats lb_stopbalance counter.  This counter can be
      calculated by: lb_balanced - lb_nobusyg - lb_nobusyq.  There is no need to
      create gazillion counters while we can derive the value.
      Signed-off-by: default avatarKen Chen <kenneth.w.chen@intel.com>
      Signed-off-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      06066714
    • Siddha, Suresh B's avatar
      [PATCH] sched: decrease number of load balances · 783609c6
      Siddha, Suresh B authored
      Currently at a particular domain, each cpu in the sched group will do a
      load balance at the frequency of balance_interval.  More the cores and
      threads, more the cpus will be in each sched group at SMP and NUMA domain.
      And we endup spending quite a bit of time doing load balancing in those
      domains.
      
      Fix this by making only one cpu(first idle cpu or first cpu in the group if
      all the cpus are busy) in the sched group do the load balance at that
      particular sched domain and this load will slowly percolate down to the
      other cpus with in that group(when they do load balancing at lower
      domains).
      Signed-off-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      783609c6
    • Mike Galbraith's avatar
      [PATCH] sched: improve migration accuracy · b18ec803
      Mike Galbraith authored
      Co-opt rq->timestamp_last_tick to maintain a cache_hot_time evaluation
      reference timestamp at both tick and sched times to prevent said reference,
      formerly rq->timestamp_last_tick, from being behind task->last_ran at
      evaluation time, and to move said reference closer to current time on the
      remote processor, intent being to improve cache hot evaluation and
      timestamp adjustment accuracy for task migration.
      
      Fix minor sched_time double accounting error which occurs when a task
      passing through schedule() does not schedule off, and takes the next timer
      tick.
      
      [kenneth.w.chen@intel.com: cleanup]
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Acked-by: default avatarKen Chen <kenneth.w.chen@intel.com>
      Cc: Don Mullis <dwm@meer.net>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b18ec803
    • Christoph Lameter's avatar
      [PATCH] sched: add option to serialize load balancing · 08c183f3
      Christoph Lameter authored
      Large sched domains can be very expensive to scan.  Add an option SD_SERIALIZE
      to the sched domain flags.  If that flag is set then we make sure that no
      other such domain is being balanced.
      
      [akpm@osdl.org: build fix]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Peter Williams <pwil3058@bigpond.net.au>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      08c183f3
    • Christoph Lameter's avatar
      [PATCH] sched: call tasklet less frequently · 1bd77f2d
      Christoph Lameter authored
      Trigger softirq less frequently
      
      We trigger the softirq before this patch using offset of sd->interval.
      However, if the queue is busy then it is sufficient to schedule the softirq
      with sd->interval * busy_factor.
      
      So we modify the calculation of the next time to balance by taking
      the interval added to last_balance again. This is only the
      right value if the idle/busy situation continues as is.
      
      There are two potential trouble spots:
      - If the queue was idle and now gets busy then we call rebalance
        early. However, that is not a problem because we will then use
        the longer interval for the next period.
      
      - If the queue was busy and becomes idle then we potentially
        wait too long before rebalancing. However, when the task
        goes idle then idle_balance is called. We add another calculation
        of the next balance time based on sd->interval in idle_balance
        so that we will rebalance soon.
      
      V2->V3:
      - Calculate rebalance time based on current jiffies and not
        based on the jiffies at the last time we load balanced.
        We no longer rely on staggering and therefore we can
        affort to do this now.
      
      V3->V4:
      - Use functions to do jiffy comparisons.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Peter Williams <pwil3058@bigpond.net.au>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      1bd77f2d
    • Christoph Lameter's avatar
      [PATCH] sched: use softirq for load balancing · c9819f45
      Christoph Lameter authored
      Call rebalance_tick (renamed to run_rebalance_domains) from a newly introduced
      softirq.
      
      We calculate the earliest time for each layer of sched domains to be rescanned
      (this is the rescan time for idle) and use the earliest of those to schedule
      the softirq via a new field "next_balance" added to struct rq.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Peter Williams <pwil3058@bigpond.net.au>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c9819f45
    • Christoph Lameter's avatar
      [PATCH] sched: move idle status calculation into rebalance_tick() · e418e1c2
      Christoph Lameter authored
      Perform the idle state determination in rebalance_tick.
      
      If we separate balancing from sched_tick then we also need to determine the
      idle state in rebalance_tick.
      
      V2->V3
      	Remove useless idlle != 0 check. Checking nr_running seems
      	to be sufficient. Thanks Suresh.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Peter Williams <pwil3058@bigpond.net.au>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      e418e1c2
    • Christoph Lameter's avatar
      [PATCH] sched: extract load calculation from rebalance_tick · 7835b98b
      Christoph Lameter authored
      A load calculation is always done in rebalance_tick() in addition to the real
      load balancing activities that only take place when certain jiffie counts have
      been reached.  Move that processing into a separate function and call it
      directly from scheduler_tick().
      
      Also extract the time slice handling from scheduler_tick and put it into a
      separate function.  Then we can clean up scheduler_tick significantly.  It
      will no longer have any gotos.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Peter Williams <pwil3058@bigpond.net.au>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      7835b98b
    • Christoph Lameter's avatar
      [PATCH] sched: disable interrupts for locking in load_balance() · fe2eea3f
      Christoph Lameter authored
      Interrupts must be disabled for request queue locks if we want to run
      load_balance() with interrupts enabled.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Peter Williams <pwil3058@bigpond.net.au>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      fe2eea3f
    • Christoph Lameter's avatar
      [PATCH] sched: remove staggering of load balancing · 4211a9a2
      Christoph Lameter authored
      Timer interrupts already are staggered.  We do not need an additional layer of
      time staggering for short load balancing actions that take a reasonably small
      portion of the time slice.
      
      For load balancing on large sched_domains we will add a serialization later
      that avoids concurrent load balance operations and thus has the same effect as
      load staggering.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Peter Williams <pwil3058@bigpond.net.au>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4211a9a2