1. 05 Jul, 2003 17 commits
    • Andrew Morton's avatar
      [PATCH] block request batching · 930805a2
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      The following patch gets batching working how it should be.
      
      After a process is woken up, it is allowed to allocate up to 32 requests
      for 20ms.  It does not stop other processes submitting requests if it isn't
      submitting though.  This should allow less context switches, and allow
      batches of requests from each process to be sent to the io scheduler
      instead of 1 request from each process.
      
      tiobench sequential writes are more than tripled, random writes are nearly
      doubled over mm1.  In earlier tests I generally saw better CPU efficiency
      but it doesn't show here.  There is still debug to be taken out.  Its also
      only on UP.
      
                                      Avg     Maximum     Lat%   Lat%   CPU
       Identifier    Rate  (CPU%)  Latency   Latency     >2s    >10s   Eff
       ------------------- ------ --------- ---------- ------- ------ ----
       -2.5.71-mm1   11.13 3.783%    46.10    24668.01   0.84   0.02   294
       +2.5.71-mm1   13.21 4.489%    37.37     5691.66   0.76   0.00   294
      
       Random Reads
       ------------------- ------ --------- ---------- ------- ------ ----
       -2.5.71-mm1    0.97 0.582%   519.86     6444.66  11.93   0.00   167
       +2.5.71-mm1    1.01 0.604%   484.59     6604.93  10.73   0.00   167
      
       Sequential Writes
       ------------------- ------ --------- ---------- ------- ------ ----
       -2.5.71-mm1    4.85 4.456%    77.80    99359.39   0.18   0.13   109
       +2.5.71-mm1   14.11 14.19%    10.07    22805.47   0.09   0.04    99
      
       Random Writes
       ------------------- ------ --------- ---------- ------- ------ ----
       -2.5.71-mm1    0.46 0.371%    14.48     6173.90   0.23   0.00   125
       +2.5.71-mm1    0.86 0.744%    24.08     8753.66   0.31   0.00   115
      
      It decreases context switch rate on IBM's 8-way on ext2 tiobench 64 threads
      from ~2500/s to ~140/s on their regression tests.
      930805a2
    • Andrew Morton's avatar
      [PATCH] generic io contexts · 16f88dbd
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      Generalise the AS-specific per-process IO context so that other IO schedulers
      could use it.
      16f88dbd
    • Andrew Morton's avatar
      [PATCH] block batching fairness · 80af89ca
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      This patch fixes the request batching fairness/starvation issue.  Its not
      clear what is going on with 2.4, but it seems that its a problem around this
      area.
      
      Anyway, previously:
      
      	* request queue fills up
      	* process 1 calls get_request, sleeps
      	* a couple of requests are freed
      	* process 2 calls get_request, proceeds
      	* a couple of requests are freed
      	* process 2 calls get_request...
      
      Now as unlikely as it seems, it could be a problem.  Its a fairness problem
      that process 2 can skip ahead of process 1 anyway.
      
      With the patch:
      
      	* request queue fills up
      	* any process calling get_request will sleep
      	* once the queue gets below the batch watermark, processes
      	  start being worken, and may allocate.
      
      
      This patch includes Chris Mason's fix to only clear queue_full when all tasks
      have been woken.  Previously I think starvation and unfairness could still
      occur.
      
      With this change to the blk-fair-batches patch, Chris is showing some much
      improved numbers for 2.4 - 170 ms max wait vs 2700ms without blk-fair-batches
      for a dbench 90 run.  He didn't indicate how much difference his patch alone
      made, but it is an important fix I think.
      80af89ca
    • Andrew Morton's avatar
      [PATCH] handle OOM in get_request_wait() · f67198fb
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      If there are no requess in flight against the target device and
      get_request() fails, nothing will wake us up.  Fix.
      f67198fb
    • Andrew Morton's avatar
      [PATCH] allow the IO scheduler to pass an allocation hint to · 08f36413
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      
      This patch implements a hint so that AS can tell the request allocator to
      allocate a request even if there are none left (the accounting is quite
      flexible and easily handles overallocations).
      
      elv_may_queue semantics have changed from "the elevator does _not_ want
      another request allocated" to "the elevator _insists_ that another request is
      allocated".  I couldn't see any harm ;)
      
      Now in practice, AS will only allow _1_ request over the limit, because as
      soon as the request is sent to AS, it stops anticipating.
      08f36413
    • Andrew Morton's avatar
      [PATCH] blk_congestion_wait threshold cleanup · 4e83dc01
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      Now that we are counting requests (not requests free), this patch changes
      the congested & batch watermarks to be more logical.  Also a minor fix to
      the sysfs code.
      4e83dc01
    • Andrew Morton's avatar
      [PATCH] per queue nr_requests · ee66147b
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      This gets rid of the global queue_nr_requests and usage of BLKDEV_MAX_RQ
      (the latter is now only used to set the queues' defaults).
      
      The queue depth becomes per-queue, controlled by a sysfs entry.
      ee66147b
    • Andrew Morton's avatar
      [PATCH] Use kblockd for running request queues · 179b68bb
      Andrew Morton authored
      Using keventd for running request_fns is risky because keventd itself can
      block on disk I/O.  Use the new kblockd kernel threads for the generic
      unplugging.
      179b68bb
    • Andrew Morton's avatar
      [PATCH] anticipatory I/O scheduler · 97ff29c2
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      This is the core anticipatory IO scheduler.  There are nearly 100 changesets
      in this and five months work.  I really cannot describe it fully here.
      
      Major points:
      
      - It works by recognising that reads are dependent: we don't know where the
        next read will occur, but it's probably close-by the previous one.  So once
        a read has completed we leave the disk idle, anticipating that a request
        for a nearby read will come in.
      
      - There is read batching and write batching logic.
      
        - when we're servicing a batch of writes we will refuse to seek away
          for a read for some tens of milliseconds.  Then the write stream is
          preempted.
      
        - when we're servicing a batch of reads (via anticipation) we'll do
          that for some tens of milliseconds, then preempt.
      
      - There are request deadlines, for latency and fairness.
        The oldest outstanding request is examined at regular intervals. If
        this request is older than a specific deadline, it will be the next
        one dispatched. This gives a good fairness heuristic while being simple
        because processes tend to have localised IO.
      
      
      Just about all of the rest of the complexity involves an array of fixups
      which prevent most of teh obvious failure modes with anticipation: trying to
      not leave the disk head pointlessly idle.  Some of these algorithms are:
      
      - Process tracking.  If the process whose read we are anticipating submits
        a write, abandon anticipation.
      
      - Process exit tracking.  If the process whose read we are anticipating
        exits, abandon anticipation.
      
      - Process IO history.  We accumulate statistical info on the process's
        recent IO patterns to aid in making decisions about how long to anticipate
        new reads.
      
        Currently thinktime and seek distance are tracked. Thinktime is the
        time between when a process's last request has completed and when it
        submits another one. Seek distance is simply the number of sectors
        between each read request. If either statistic becomes too high, the
        it isn't anticipated that the process will submit another read.
      
      The above all means that we need a per-process "io context".  This is a fully
      refcounted structure.  In this patch it is AS-only.  later we generalise it a
      little so other IO schedulers could use the same framework.
      
      - Requests are grouped as synchronous and asynchronous whereas deadline
        scheduler groups requests as reads and writes. This can provide better
        sync write performance, and may give better responsiveness with journalling
        filesystems (although we haven't done that yet).
      
        We currently detect synchronous writes by nastily setting PF_SYNCWRITE in
        current->flags.  The plan is to remove this later, and to propagate the
        sync hint from writeback_contol.sync_mode into bio->bi_flags thence into
        request->flags.  Once that is done, direct-io needs to set the BIO sync
        hint as well.
      
      - There is also quite a bit of complexity gone into bashing TCQ into
        submission. Timing for a read batch is not started until the first read
        request actually completes. A read batch also does not start until all
        outstanding writes have completed.
      
      AS is the default IO scheduler.  deadline may be chosen by booting with
      "elevator=deadline".
      
      There are a few reasons for retaining deadline:
      
      - AS is often slower than deadline in random IO loads with large TCQ
        windows. The usual real world task here is OLTP database loads.
      
      - deadline is presumably more stable.
      
      - deadline is much simpler.
      
      
      
      The tunable per-queue entries under /sys/block/*/iosched/ are all in
      milliseconds:
      
      * read_expire
      
        Controls how long until a request becomes "expired".
      
        It also controls the interval between which expired requests are served,
        so set to 50, a request might take anywhere < 100ms to be serviced _if_ it
        is the next on the expired list.
      
        Obviously it can't make the disk go faster.  Result is basically the
        timeslice a reader gets in the presence of other IO.  100*((seek time /
        read_expire) + 1) is very roughly the % streaming read efficiency your disk
        should get in the presence of multiple readers.
      
      * read_batch_expire
      
        Controls how much time a batch of reads is given before pending writes
        are served.  Higher value is more efficient.  Shouldn't really be below
        read_expire.
      
      * write_ versions of the above
      
      * antic_expire
      
        Controls the maximum amount of time we can anticipate a good read before
        giving up.  Many other factors may cause anticipation to be stopped early,
        or some processes will not be "anticipated" at all.  Should be a bit higher
        for big seek time devices though not a linear correspondance - most
        processes have only a few ms thinktime.
      97ff29c2
    • Andrew Morton's avatar
      [PATCH] elevator completion API · 104e6fdc
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      Introduces an elevator_completed_req() callback with which the generic
      queueing layer may tell an IO scheduler that a particualr request has
      finished.
      104e6fdc
    • Andrew Morton's avatar
      [PATCH] elv_may_queue() API function · 7d2483a9
      Andrew Morton authored
      Introduces the elv_may_queue() predicate with which the IO scheduler may tell
      the generic request layer that we may add another request to this queue.
      
      It is used by the CFQ elevator.
      7d2483a9
    • Andrew Morton's avatar
      [PATCH] Create `kblockd' workqueue · 33c66485
      Andrew Morton authored
      keventd is inappropriate for running block request queues because keventd
      itself can get blocked on disk I/O.  Via call_usermodehelper()'s vfork and,
      presumably, GFP_KERNEL allocations.
      
      So create a new gang of kernel threads whose mandate is for running low-level
      disk operations.  It must ever block on disk IO, so any memory allocations
      should be GFP_NOIO.
      
      We mainly use it for running unplug operations from interrupt context.
      33c66485
    • Andrew Morton's avatar
      [PATCH] bring back the batch_requests function · 3abbd8ff
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      The batch_requests function got lost during the merge of the dynamic request
      allocation patch.
      
      We need it for the anticipatory scheduler - when the number of threads
      exceeds the number of requests, the anticipated-upon task will undesirably
      sleep in get_request_wait().
      
      And apparently some block devices which use small requests need it so they
      string a decent number together.
      
      Jens has acked this patch.
      3abbd8ff
    • Andrew Morton's avatar
      [PATCH] ipc semaphore optimization · 3faa61fe
      Andrew Morton authored
      From: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      
      This patch proposes a performance fix for the current IPC semaphore
      implementation.
      
      There are two shortcoming in the current implementation:
      try_atomic_semop() was called two times to wake up a blocked process,
      once from the update_queue() (executed from the process that wakes up
      the sleeping process) and once in the retry part of the blocked process
      (executed from the block process that gets woken up).
      
      A second issue is that when several sleeping processes that are eligible
      for wake up, they woke up in daisy chain formation and each one in turn
      to wake up next process in line.  However, every time when a process
      wakes up, it start scans the wait queue from the beginning, not from
      where it was last scanned.  This causes large number of unnecessary
      scanning of the wait queue under a situation of deep wait queue.
      Blocked processes come and go, but chances are there are still quite a
      few blocked processes sit at the beginning of that queue.
      
      What we are proposing here is to merge the portion of the code in the
      bottom part of sys_semtimedop() (code that gets executed when a sleeping
      process gets woken up) into update_queue() function.  The benefit is two
      folds: (1) is to reduce redundant calls to try_atomic_semop() and (2) to
      increase efficiency of finding eligible processes to wake up and higher
      concurrency for multiple wake-ups.
      
      We have measured that this patch improves throughput for a large
      application significantly on a industry standard benchmark.
      
      This patch is relative to 2.5.72.  Any feedback is very much
      appreciated.
      
      Some kernel profile data attached:
      
        Kernel profile before optimization:
        -----------------------------------------------
                      0.05    0.14   40805/529060      sys_semop [133]
                      0.55    1.73  488255/529060      ia64_ret_from_syscall
      [2]
      [52]     2.5    0.59    1.88  529060         sys_semtimedop [52]
                      0.05    0.83  477766/817966      schedule_timeout [62]
                      0.34    0.46  529064/989340      update_queue [61]
                      0.14    0.00 1006740/6473086     try_atomic_semop [75]
                      0.06    0.00  529060/989336      ipcperms [149]
        -----------------------------------------------
      
                      0.30    0.40  460276/989340      semctl_main [68]
                      0.34    0.46  529064/989340      sys_semtimedop [52]
      [61]     1.5    0.64    0.87  989340         update_queue [61]
                      0.75    0.00 5466346/6473086     try_atomic_semop [75]
                      0.01    0.11  477676/576698      wake_up_process [146]
        -----------------------------------------------
                      0.14    0.00 1006740/6473086     sys_semtimedop [52]
                      0.75    0.00 5466346/6473086     update_queue [61]
      [75]     0.9    0.89    0.00 6473086         try_atomic_semop [75]
        -----------------------------------------------
      
        Kernel profile with optimization:
      
        -----------------------------------------------
                      0.03    0.05   26139/503178      sys_semop [155]
                      0.46    0.92  477039/503178      ia64_ret_from_syscall
      [2]
      [61]     1.2    0.48    0.97  503178         sys_semtimedop [61]
                      0.04    0.79  470724/784394      schedule_timeout [62]
                      0.05    0.00  503178/3301773     try_atomic_semop [109]
                      0.05    0.00  503178/930934      ipcperms [149]
                      0.00    0.03   32454/460210      update_queue [99]
        -----------------------------------------------
                      0.00    0.03   32454/460210      sys_semtimedop [61]
                      0.06    0.36  427756/460210      semctl_main [75]
      [99]     0.4    0.06    0.39  460210         update_queue [99]
                      0.30    0.00 2798595/3301773     try_atomic_semop [109]
                      0.00    0.09  470630/614097      wake_up_process [146]
        -----------------------------------------------
                      0.05    0.00  503178/3301773     sys_semtimedop [61]
                      0.30    0.00 2798595/3301773     update_queue [99]
      [109]    0.3    0.35    0.00 3301773         try_atomic_semop [109]
        -----------------------------------------------=20
      
      Both number of function calls to try_atomic_semop() and update_queue()
      are reduced by 50% as a result of the merge.  Execution time of
      sys_semtimedop is reduced because of the reduction in the low level
      functions.
      3faa61fe
    • Andrew Morton's avatar
      [PATCH] PCI domain scanning fix · d8d90b60
      Andrew Morton authored
      From: Matthew Wilcox <willy@debian.org>
      
      ppc64 oopses on boot because pci_scan_bus_parented() is unexpectedly
      returning NULL.  Change pci_scan_bus_parented() to correctly handle
      overlapping PCI bus numbers on different domains.
      d8d90b60
    • Linus Torvalds's avatar
      Merge bk://ppc.bkbits.net/for-linus-ppc · b40585d0
      Linus Torvalds authored
      into home.osdl.org:/home/torvalds/v2.5/linux
      b40585d0
    • Paul Mackerras's avatar
      Merge samba.org:/home/paulus/kernel/linux-2.5 · b727fa42
      Paul Mackerras authored
      into samba.org:/home/paulus/kernel/for-linus-ppc
      b727fa42
  2. 04 Jul, 2003 15 commits
    • Ulrich Drepper's avatar
      [PATCH] wrong pid in siginfo_t · c7aa953c
      Ulrich Drepper authored
      If a signal is sent via kill() or tkill() the kernel fills in the wrong
      PID value in the siginfo_t structure (obviously only if the handler has
      SA_SIGINFO set).
      
      POSIX specifies the the si_pid field is filled with the process ID, and
      in Linux parlance that's the "thread group" ID, not the thread ID.
      c7aa953c
    • Linus Torvalds's avatar
      When forcing through a signal for some thread-synchronous · 9e008c3c
      Linus Torvalds authored
      event (ie SIGSEGV, SIGFPE etc that happens as a result of a
      trap as opposed to an external event), if the signal is
      blocked we will not invoce a signal handler, we will just
      kill the thread with the signal.
      
      This is equivalent to what we do in the SIG_IGN case: you
      cannot ignore or block synchronous signals, and if you try,
      we'll just have to kill you.
      
      We don't want to handle endless recursive faults, which the
      old behaviour easily led to if the stack was bad, for example.
      9e008c3c
    • Linus Torvalds's avatar
      Go back to defaulting to 6-byte commands for MODE SENSE, · b79c8524
      Linus Torvalds authored
      since some drivers seem to be unhappy about the 10-byte
      version. 
      
      The subsystem configuration can override this (eg USB or
      ide-scsi).
      b79c8524
    • Marc Zyngier's avatar
      [PATCH] EISA: avoid unnecessary probing · c4404d65
      Marc Zyngier authored
      - By default, do not try to probe the bus if the mainboard does not
        seems to support EISA (allow this behaviour to be changed through a
        command-line option).
      c4404d65
    • Marc Zyngier's avatar
      [PATCH] EISA: PCI-EISA dma_mask · e34121f7
      Marc Zyngier authored
      - Use parent bridge device dma_mask as default for each discovered
        device.
      e34121f7
    • Marc Zyngier's avatar
      [PATCH] EISA: PA-RISC changes · e0e5907e
      Marc Zyngier authored
      - Probe the right number of EISA slots on PA-RISC. No more, no less.
      e0e5907e
    • Marc Zyngier's avatar
      [PATCH] EISA: More EISA ids · 5fe1dbf4
      Marc Zyngier authored
      5fe1dbf4
    • Marc Zyngier's avatar
      [PATCH] EISA: Documentation update · d8d9c9e8
      Marc Zyngier authored
      d8d9c9e8
    • Marc Zyngier's avatar
      [PATCH] EISA: core changes · ddb6ee51
      Marc Zyngier authored
      - Now reserves I/O ranges according to EISA specs (four 256 bytes
        regions instead of a single 4KB region).
      
      - By default, do not try to probe the bus if the mainboard does not
        seems to support EISA (allow this behaviour to be changed through a
        command-line option).
      
      - Use parent bridge device dma_mask as default for each discovered
        device.
      
      - Allow devices to be enabled or disabled from the kernel command line
        (useful for non-x86 platforms where the firmware simply disable
        devices it doesn't know about...).
      ddb6ee51
    • Linus Torvalds's avatar
      Merge bk://kernel.bkbits.net/jgarzik/irda-2.5 · ee389f0a
      Linus Torvalds authored
      into home.osdl.org:/home/torvalds/v2.5/linux
      ee389f0a
    • Linus Torvalds's avatar
      Carl-Daniel Hailfinger suggest adding a paranoid incoming · 87d890b8
      Linus Torvalds authored
      trigger as per the "bk help triggers" suggestion, so that
      we'll see any new triggers showing up in the tree.
      
      Make it so.
      87d890b8
    • Trond Myklebust's avatar
      [PATCH] Use the intents in 'nameidata' to improve NFS close-to-open consistency · 52d1430d
      Trond Myklebust authored
        - Make use of the open intents to improve close-to-open
          cache consistency. Only force data cache revalidation when
          we're doing an open().
      
        - Add true exclusive create to NFSv3.
      
        - Optimize away the redundant ->lookup() to check for an
          existing file when we know that we're doing NFSv3 exclusive
          create.
      
        - Optimize away all ->permission() checks other than those for
          path traversal, open(), and sys_access().
      52d1430d
    • Trond Myklebust's avatar
      [PATCH] Pass 'nameidata' to ->permission() · a574f324
      Trond Myklebust authored
         - Make the VFS pass the struct nameidata as an optional parameter
           to the permission() inode operation.
      
         - Patch may_create()/may_open() so it passes the struct nameidata from
           vfs_create()/open_namei() as an argument to permission().
      
         - Add an intent flag for the sys_access() function.
      a574f324
    • Trond Myklebust's avatar
      [PATCH] Pass 'nameidata' to ->create() · 675b5da0
      Trond Myklebust authored
        - Make the VFS pass the struct nameidata as an optional argument
          to the create inode operation.
        - Patch vfs_create() to take a struct nameidata as an optional
          argument.
      675b5da0
    • Trond Myklebust's avatar
      [PATCH] Add open intent information to the 'struct nameidata' · fc8b427e
      Trond Myklebust authored
       - Add open intent information to the 'struct nameidata'.
       - Pass the struct nameidata as an optional parameter to the
         lookup() inode operation.
       - Pass the struct nameidata as an optional parameter to the
         d_revalidate() dentry operation.
       - Make link_path_walk() set the LOOKUP_CONTINUE flag in nd->flags instead
         of passing it as an extra parameter to d_revalidate().
       - Make open_namei(), and sys_uselib() set the open()/create() intent
         data.
      fc8b427e
  3. 03 Jul, 2003 8 commits