1. 06 Feb, 2003 40 commits
    • Mark Haverkamp's avatar
      [PATCH] fix megaraid driver compile error · 3fa327f8
      Mark Haverkamp authored
      This moves access of the host element to device since host has been
      removed from struct scsi_cmnd.
      3fa327f8
    • Ingo Molnar's avatar
      [PATCH] signal-fixes-2.5.59-A4 · ebf5ebe3
      Ingo Molnar authored
      this is the current threading patchset, which accumulated up during the
      past two weeks. It consists of a biggest set of changes from Roland, to
      make threaded signals work. There were still tons of testcases and
      boundary conditions (mostly in the signal/exit/ptrace area) that we did
      not handle correctly.
      
      Roland's thread-signal semantics/behavior/ptrace fixes:
      
       - fix signal delivery race with do_exit() => signals are re-queued to the
         'process' if do_exit() finds pending unhandled ones. This prevents
         signals getting lost upon thread-sys_exit().
      
       - a non-main thread has died on one processor and gone to TASK_ZOMBIE,
         but before it's gotten to release_task a sys_wait4 on the other
         processor reaps it.  It's only because it's ptraced that this gets
         through eligible_child.  Somewhere in there the main thread is also
         dying so it reparents the child thread to hit that case.  This means
         that there is a race where P might be totally invalid.
      
       - forget_original_parent is not doing the right thing when the group
         leader dies, i.e. reparenting threads to init when there is a zombie
         group leader.  Perhaps it doesn't matter for any practical purpose
         without ptrace, though it makes for ppid=1 for each thread in core
         dumps, which looks funny. Incidentally, SIGCHLD here really should be
         p->exit_signal.
      
       - one of the gdb tests makes a questionable assumption about what kill
         will do when it has some threads stopped by ptrace and others running.
      
      exit races:
      
      1. Processor A is in sys_wait4 case TASK_STOPPED considering task P.
         Processor B is about to resume P and then switch to it.
      
         While A is inside that case block, B starts running P and it clears
         P->exit_code, or takes a pending fatal signal and sets it to a new
         value. Depending on the interleaving, the possible failure modes are:
              a. A gets to its put_user after B has cleared P->exit_code
                 => returns with WIFSTOPPED, WSTOPSIG==0
              b. A gets to its put_user after B has set P->exit_code anew
                 => returns with e.g. WIFSTOPPED, WSTOPSIG==SIGKILL
      
         A can spend an arbitrarily long time in that case block, because
         there's getrusage and put_user that can take page faults, and
         write_lock'ing of the tasklist_lock that can block.  But even if it's
         short the race is there in principle.
      
      2. This is new with NPTL, i.e. CLONE_THREAD.
         Two processors A and B are both in sys_wait4 case TASK_STOPPED
         considering task P.
      
         Both get through their tests and fetches of P->exit_code before either
         gets to P->exit_code = 0.  => two threads return the same pid from
         waitpid.
      
         In other interleavings where one processor gets to its put_user after
         the other has cleared P->exit_code, it's like case 1(a).
      
      
      3. SMP races with stop/cont signals
      
         First, take:
      
              kill(pid, SIGSTOP);
              kill(pid, SIGCONT);
      
         or:
      
              kill(pid, SIGSTOP);
              kill(pid, SIGKILL);
      
         It's possible for this to leave the process stopped with a pending
         SIGCONT/SIGKILL.  That's a state that should never be possible.
         Moreover, kill(pid, SIGKILL) without any repetition should always be
         enough to kill a process.  (Likewise SIGCONT when you know it's
         sequenced after the last stop signal, must be sufficient to resume a
         process.)
      
      4. take:
      
              kill(pid, SIGKILL);     // or any fatal signal
              kill(pid, SIGCONT);     // or SIGKILL
      
          it's possible for this to cause pid to be reaped with status 0
          instead of its true termination status.  The equivalent scenario
          happens when the process being killed is in an _exit call or a
          trap-induced fatal signal before the kills.
      
      plus i've done stability fixes for bugs that popped up during
      beta-testing, and minor tidying of Roland's changes:
      
       - a rare tasklist corruption during exec, causing some very spurious and
         colorful crashes.
      
       - a copy_process()-related dereference of already freed thread structure
         if hit with a SIGKILL in the wrong moment.
      
       - SMP spinlock deadlocks in the signal code
      
      this patchset has been tested quite well in the 2.4 backport of the
      threading changes - and i've done some stresstesting on 2.5.59 SMP as
      well, and did an x86 UP testcompile + testboot as well.
      ebf5ebe3
    • David Jeffery's avatar
      [PATCH] ips driver 4/4: error messages · 44a5a59c
      David Jeffery authored
      This small patch does 2 things.  It reworks the firmware/driver
      versioning messages to make them more understandable, and it
      fixes one case where the 64bit addressing changes caused
      error/success to not be properly reported to the serveraid tools.
      44a5a59c
    • David Jeffery's avatar
      [PATCH] ips driver 3/4: 64bit dma addressing · d31bb16c
      David Jeffery authored
      This large patch adds support for using 64bit addressing.
      
      Special thanks goes to Mike Anderson who did the initial
      versions of this patch.
      d31bb16c
    • David Jeffery's avatar
      [PATCH] ips driver 2/4: initialization reordering · 836f40cb
      David Jeffery authored
      This large patch reworks much of the adapter initialization
      code.
      
      It splits the scsi initialization code from the pci
      initialization.  It adds support for working with some
      future cards.  It also removes the use of multiple pci_driver
      registrations and instead does its own adapter ordering.
      836f40cb
    • David Jeffery's avatar
      [PATCH] ips driver 1/4: fix struct length and remove dead code · 9d252c21
      David Jeffery authored
      This small patch fixes the length of the IPS_ENQ
      struct.  It was too short which can cause the adapter
      to write beyond the the end of the struct during
      driver initialization and corrupt part of memory.
      9d252c21
    • Linus Torvalds's avatar
      Merge http://linux-scsi.bkbits.net/scsi-for-linus-2.5 · bd15d114
      Linus Torvalds authored
      into home.transmeta.com:/home/torvalds/v2.5/linux
      bd15d114
    • James Bottomley's avatar
      Merge raven.il.steeleye.com:/home/jejb/BK/scsi-misc-2.5 · 35766eb7
      James Bottomley authored
      into raven.il.steeleye.com:/home/jejb/BK/scsi-for-linus-2.5
      35766eb7
    • Christoph Hellwig's avatar
      [PATCH] coding style updates for scsi_lib.c · 78ef52ec
      Christoph Hellwig authored
      I just couldn't see the mess anymore..  Nuke the ifdefs and use sane
      variable names.  Some more small nitpicks but no behaviour changes at
      all.
      78ef52ec
    • Rusty Russell's avatar
      [PATCH] 2.5.59 add two help texts to drivers_scsi_Kconfig · baaf76dd
      Rusty Russell authored
      From:  Steven Cole <elenstev@mesatop.com>
      
        Here are some help texts from 2.4.21-pre3 Configure.help which are
        needed in 2.5.59 drivers/scsi/Kconfig.
      
        Steven
      baaf76dd
    • Rusty Russell's avatar
      [PATCH] [patch, 2.5] scsi_qla1280.c free on error path · f8646d20
      Rusty Russell authored
      From:  Marcus Alanen <maalanen@ra.abo.fi>
      
        Remove check_region in favour of request_region. Free resources
        properly on error path. Horribly subtle ioremap/iounmap lurks here I
        think, in qla1280_pci_config(), which the below patch should take care
        of.
      
        I'm wondering if there couldn't / shouldn't be a better way to
        allocate resources. Obviously lots of drivers have broken error paths.
        Is this even necessary?
      
        Marcus
      
      
        #
        # create_patch: qla1280_release_on_error_path-2002-12-08-A.patch
        # Date: Sun Dec  8 22:32:33 EET 2002
        #
      f8646d20
    • Christoph Hellwig's avatar
      [SCSI] Remove host_active · d30a24be
      Christoph Hellwig authored
      It isn't used anywhere anymore
      d30a24be
    • Linus Torvalds's avatar
      Merge http://linux-acpi.bkbits.net/linux-acpi · 26d987a7
      Linus Torvalds authored
      into home.transmeta.com:/home/torvalds/v2.5/linux
      26d987a7
    • Andy Grover's avatar
      ACPI: Enable compilation w/o cpufreq · 32dbc81b
      Andy Grover authored
      32dbc81b
    • Randy Dunlap's avatar
      [PATCH] quota memleak · a2dd1464
      Randy Dunlap authored
      The Stanford Checker found a memleak.
      a2dd1464
    • Linus Torvalds's avatar
      Merge bk://kernel.bkbits.net/vojtech/x86-64 · d0d3f1f0
      Linus Torvalds authored
      into home.transmeta.com:/home/torvalds/v2.5/linux
      d0d3f1f0
    • Vojtech Pavlik's avatar
    • Andrew Morton's avatar
      [PATCH] Fix signed use of i_blocks in ext3 truncate · 9a3e1a96
      Andrew Morton authored
      Patch from "Stephen C. Tweedie" <sct@redhat.com>
      
      Fix "h_buffer_credits<0" assert failure during truncate.
      
      The bug occurs when the "i_blocks" count in the file's inode overflows
      past 2^31.  That works fine most of the time, because i_blocks is an
      unsigned long, and should go up to 2^32; but there's a place in truncate
      where ext3 calculates the size of the next transaction chunk for the
      delete, and that mistakenly uses a signed long instead.  Because the
      huge i_blocks gets cast to a negative value, ext3 does not reserve
      enough credits for the transaction and the above error results.
      
      This is usually only possible on filesystems corrupted for other
      reasons, but it is reproducible if you create a single, non-sparse file
      larger than 1TB on ext3 and then try to delete it.
      9a3e1a96
    • Andrew Morton's avatar
      [PATCH] CPU Hotplug mm/slab.c CPU_UP_CANCELED fix · 4f1cb3ff
      Andrew Morton authored
      Patch from Manfred Spraul.
      
      Fixes a bug which was exposed by Zwane's hotplug CPU work.  The
      cache_cache.array pointer is initially given a temp bootstrap area, which is
      later converted over to the final value after the CPU is brought up.
      
      But if slab is enhanced to permit cancellation of a CPU bringup, this pointer
      ends up pointing at stale memory.  So reinitialise it by hand when
      kmem_cache_init() is run.
      4f1cb3ff
    • Andrew Morton's avatar
      [PATCH] spinlock debugging on uniprocessors · ecd2d220
      Andrew Morton authored
      Patch from Manfred Spraul <manfred@colorfullife.com>
      
      This enables spinlock debuggng on uniprocessor builds, under
      CONFIG_DEBUG_SPINLOCK.
      
      The reason I want this is that one day we'll need to pull out the debugging
      support from the timer code which detects uninitialised timers.  And once
      that has gone, uniprocessor developers and testers have no way of detecting
      uninitialised timers - there will be mysterious deadlocks on SMP machines.
      And there will surely be more uninitialised timers
      
      The patch also removes the last pieces of the support for including
      <asm/spinlock.h> directly.  Doesn't work since (IIRC) 2.3.x
      ecd2d220
    • Andrew Morton's avatar
      [PATCH] mm/mremap.c whitespace cleanup · 32738fbf
      Andrew Morton authored
      - Not everyone uses 160-column xterms.
      
      - Coding style consistency
      32738fbf
    • Andrew Morton's avatar
      [PATCH] hugetlb mremap fix · df79ea40
      Andrew Morton authored
      If you attempt to perform a relocating 4k-aligned mremap and the new address
      for the map lands on top of a hugepage VMA, do_mremap() will attempt to
      perform a 4k-aligned unmap inside the hugetlb VMA.  The hugetlb layer goes
      BUG.
      
      Fix that by trapping the poorly-aligned unmap attempt in do_munmap().
      do_remap() will then fall through without having done anything to the place
      where it tests for a hugetlb VMA.
      
      It would be neater to perform these checks on entry to do_mremap(), but that
      would incur another VMA lookup.
      
      Also, if you attempt to perform a 4k-aligned and/or sized munmap() inside a
      hugepage VMA the same BUG happens.  This patch fixes that too.
      
      This all means that an mremap attempt against a hugetlb area will fail, but
      only after having unmapped the source pages.  That's a bit messy, but
      supporting hugetlb mremap doesn't seem worth it, and completely disallowing
      it will add overhead to normal mremaps.
      df79ea40
    • Andrew Morton's avatar
      [PATCH] Fix hugetlb_vmtruncate_list() · 8a1335e9
      Andrew Morton authored
      This function is quite wrong - has an "=" where it should have a "-" and
      confuses PAGE_SIZE and HPAGE_SIZE in its address and file offset arithmetic.
      8a1335e9
    • Andrew Morton's avatar
      [PATCH] ia32 hugetlb cleanup · a20d5200
      Andrew Morton authored
      - whitespace
      
      - remove unneeded spinlocking no-op.
      a20d5200
    • Andrew Morton's avatar
      [PATCH] Fix hugetlbfs faults · 8b5111ec
      Andrew Morton authored
      If the underlying mapping was truncated and someone references the
      now-unmapped memory the kernel will enter handle_mm_fault() and will start
      instantiating PAGE_SIZE pte's inside the hugepage VMA.  Everything goes
      generally pear-shaped.
      
      So trap this in handle_mm_fault().  It adds no overhead to non-hugepage
      builds.
      
      Another possible fix would be to not unmap the huge pages at all in truncate
      - just anonymise them.
      
      But I think we want full ftruncate semantics for hugepages for management
      purposes.
      8b5111ec
    • Andrew Morton's avatar
      [PATCH] Give all architectures a hugetlb_nopage(). · 08a1cc4e
      Andrew Morton authored
      If someone maps a hugetlbfs file, then truncates it, then references the part
      of the mapping outside the truncation point, they take a pagefault and we end
      up hitting hugetlb_nopage().
      
      We want to prevent this from ever happening.  This patch just makes sure that
      all architectures have a goes-BUG hugetlb_nopage() to trap it.
      08a1cc4e
    • Andrew Morton's avatar
      [PATCH] hugetlbfs cleanups · 3cc33271
      Andrew Morton authored
      - Remove quota code.
      
      - Remove extraneous copy-n-paste code from truncate: that's only for
        physically-backed filesystems.
      
      - Whitespace changes.
      3cc33271
    • Andrew Morton's avatar
      [PATCH] hugetlbfs i_size fixes · 05732657
      Andrew Morton authored
      We're expanding hugetlbfs i_size in the wrong place.  If someone attempts to
      mmap more pages than are available, i_size is updated to reflect the
      attempted mapping size.
      
      So set i_size only when pages are successfully added to the mapping.
      
      i_size handling at truncate time is still a bit wrong - if the mapping has
      pages at (say) page offset 100-200 and the mappng is truncated to (say) page
      offset 50, i_size should be set to zero.  But it is instead set to
      50*HPAGE_SIZE.  That's harmless.
      05732657
    • Andrew Morton's avatar
      [PATCH] hugetlbfs: fix truncate · 136963d1
      Andrew Morton authored
      - Opening a hugetlbfs file O_TRUNC calls the generic vmtruncate() functions
        and nukes the kernel.
      
        Give S_ISREG hugetlbfs files a inode_operations, and hence a setattr
        which know how to handle these files.
      
      - Don't permit the user to truncate hugetlbfs files to sizes which are not
        a multiple of HPAGE_SIZE.
      
      - We don't support expanding in ftruncate(), so remove that code.
      136963d1
    • Andrew Morton's avatar
      [PATCH] get_unmapped_area for hugetlbfs · 8ca8cd5b
      Andrew Morton authored
      Having to specify the mapping address is a pain.  Give hugetlbfs files a
      file_operations.get_unmapped_area().
      
      The implementation is in hugetlbfs rather than in arch code because it's
      probably common to several architectures.  If the architecture has special
      needs it can define HAVE_ARCH_HUGETLB_UNMAPPED_AREA and go it alone.  Just
      like HAVE_ARCH_UNMAPPED_AREA.
      8ca8cd5b
    • Andrew Morton's avatar
      [PATCH] convert hugetlb code to use compound pages · b3a656b6
      Andrew Morton authored
      The odd thing about hugetlb is that it maintains its own freelist of pages.
      And it has to do that, else it would trivially run out of pages due to buddy
      fragmetation.
      
      So we we don't want callers of put_page() to be passing those pages
      to __free_pages_ok() on the final put().
      
      So hugetlb installs a destructor in the compound pages to point at
      free_huge_page(), which knows how to put these pages back onto the free list.
      
      Also, don't mark hugepages as all PageReserved any more.  That's preenting
      callers from doing proper refcounting.  Any code which does a user pagetable
      walk and hits part of a hugepage will now handle it transparently.
      b3a656b6
    • Andrew Morton's avatar
      [PATCH] Infrastructure for correct hugepage refcounting · eefb08ee
      Andrew Morton authored
      We currently have a problem when things like ptrace, futexes and direct-io
      try to pin user pages.  If the user's address is in a huge page we're
      elevting the refcount of a constituent 4k page, not the head page of the
      high-order allocation unit.
      
      To solve this, a generic way of handling higher-order pages has been
      implemented:
      
      - A higher-order page is called a "compound page".  Chose this because
        "huge page", "large page", "super page", etc all seem to mean different
        things to different people.
      
      - The first (controlling) 4k page of a compound page is referred to as the
        "head" page.
      
      - The remaining pages are tail pages.
      
      All pages have PG_compound set.  All pages have their lru.next pointing at
      the head page (even the head page has this).
      
      The head page's lru.prev, if non-zero, holds the address of the compound
      page's put_page() function.
      
      The order of the allocation is stored in the first tail page's lru.prev.
      This is only for debug at present.  This usage means that zero-order pages
      may not be compound.
      
      The above relationships are established for _all_ higher-order pages in the
      page allocator.  Which has some cost, but not much - another atomic op during
      fork(), mainly.
      
      This functionality is only enabled if CONFIG_HUGETLB_PAGE, although it could
      be turned on permanently.  There's a little extra cost in get_page/put_page.
      
      These changes do not preclude adding compound pages to the LRU in the future
      - we can add a new page flag to the head page and then move all the
      additional data to the first tail page's lru.next, lru.prev, list.next,
      list.prev, index, private, etc.
      eefb08ee
    • Andrew Morton's avatar
      [PATCH] give hugetlbfs a set_page_dirty a_op · 6725839b
      Andrew Morton authored
      Seems that nobody has tested direct IO into hugetlb pages yet.  The VFS gets
      upset about running set_page_dirty() against a non-uptodate page.
      
      So give hugetlbfs inodes a private no-op ->set_page_dirty() to isolate them
      from all that.
      6725839b
    • Andrew Morton's avatar
      [PATCH] pte_chain_alloc fixes · afcde6ef
      Andrew Morton authored
      There are several places in which the return value from pte_chain_alloc() is
      not being checked, and one place in which a GFP_KERNEL allocatiopn is
      happening inside spinlock.
      afcde6ef
    • Andrew Morton's avatar
      [PATCH] loop inefficiency fix · a1329fe8
      Andrew Morton authored
      Patch from Hugh Dickins <hugh@veritas.com>
      
      The loop driver's loop over elements of bi_io_vec is in lo_send and
      lo_receive: iterating that same transfer bi_vcnt times at the level above is,
      er, excessive.  (And no need to increment bi_idx here.)
      a1329fe8
    • Andrew Morton's avatar
      [PATCH] default_idle micro-optimisation · 87afb5f6
      Andrew Morton authored
      Patch from rwhron@earthlink.net
      
      Micro-optimization of default_idle from -aa.  current_cpu_data.hlt_works_ok
      is only false for some old 386/486 pcs.
      87afb5f6
    • Andrew Morton's avatar
      [PATCH] Optimise follow_page() for page-table-based hugepages · 1f1921fc
      Andrew Morton authored
      ia32 and others can determine a page's hugeness by inspecting the pmd's value
      directly.  No need to perform a VMA lookup against the user's virtual
      address.
      
      This patch ifdef's away the VMA-based implementation of
      hugepage-aware-follow_page for ia32 and replaces it with a pmd-based
      implementation.
      
      The intent is that architectures will implement one or the other.  So the architecture either:
      
      1: Implements hugepage_vma()/follow_huge_addr(), and stubs out
         pmd_huge()/follow_huge_pmd() or
      
      2: Implements pmd_huge()/follow_huge_pmd(), and stubs out
         hugepage_vma()/follow_huge_addr()
      1f1921fc
    • Andrew Morton's avatar
      [PATCH] Fix futexes in huge pages · f93fcfa9
      Andrew Morton authored
      Using a futex in a large page causes a kernel lockup in __pin_page() -
      because __pin_page's page revalidation uses follow_page(), and follow_page()
      doesn't work for hugepages.
      
      The patch fixes up follow_page() to return the appropriate 4k page for
      hugepages.
      
      This incurs a vma lookup for each follow_page(), which is considerable
      overhead in some situations.  We only _need_ to do this if the architecture
      cannot determin a page's hugeness from the contents of the PMD.
      
      So this patch is a "reference" implementation for, say, PPC BAT-based
      hugepages.
      f93fcfa9
    • Andrew Morton's avatar
      [PATCH] ia32 IRQ distribution rework · 08f16f8f
      Andrew Morton authored
      Patch from "Kamble, Nitin A" <nitin.a.kamble@intel.com>
      
      Hello All,
      
        We were looking at the performance impact of the IRQ routing from
      the 2.5.52 Linux kernel. This email includes some of our findings
      about the way the interrupts are getting moved in the 2.5.52 kernel.
      Also there is discussion and a patch for a new implementation. Let
      me know what you think at nitin.a.kamble@intel.com
      
      Current implementation:
      ======================
      We have found that the existing implementation works well on IA32
      SMP systems with light load of interrupts. Also we noticed that it
      is not working that well under heavy interrupt load conditions on
      these SMP systems. The observations are:
      
      * Interrupt load of each IRQ is getting balanced on CPUs independent
      of load of other IRQs. Also the current implementation moves the
      IRQs randomly. This works well when the interrupt load is light. But
      we start seeing imbalance of interrupt load with existence of
      multiple heavy interrupt sources. Frequently multiple heavily loaded
      IRQs gets moved to a single CPU while other CPUs stay very lightly
      loaded. To achieve a good interrupts load balance, it is important to
      consider the load of all the interrupts together.
          This further can be explained with an example of 4 CPUs and 4
      heavy interrupt sources. With the existing random movement approach,
      the chance of each of these heavy interrupt sources moving to separate
      CPUs is: (4/4)*(3/4)*(2/4)*(1/4) = 3/16. It means 13/16 = 81.25% of
      the time the situation is, some CPUs are very lightly loaded and some
      are loaded with multiple heavy interrupts. This causes the interrupt
      load imbalance and results in less performance. In a case of 2 CPUs
      and 2 heavily loaded interrupt sources, this imbalance happens
      1/2 = 50% of the times. This issue becomes more and more severe with
      increasing number of heavy interrupt sources.
      
      * Another interesting observation is: We cannot see the imbalance
      of the interrupt load from /proc/interrupts. (/proc/interrupts shows
      the cumulative load of interrupts on all CPUs.) If the interrupt load
      is imbalanced and this imbalance is getting rotated among CPUs
      continuously, then /proc/interrupts will still show that the interrupt
      load is going to processors very evenly. Currently at the frequency
      (HZ/50) at which IRQs are moved across CPUs, it is not possible to
      see any interrupt load imbalance happening.
      
      * We have also found that, in certain cases the static IRQ binding
      performs better than the existing kernel distribution of interrupt
      load. The reason is, in a well-balanced interrupt load situations,
      these interrupts are unnecessarily getting frequently moved across
      CPUs. This adds an extra overhead; also it takes off the CPU cache
      warmth benefits.
        This came out from the performance measurements done on a 4-way HT
      (8 logical processors) Pentium 4 Xeon system running 8 copies of
      netperf. The 4 NICs in the system taking different IRQs generated
      sizable interrupt load with the help of connected clients.
      
      Here the netperf transactions/sec throughput numbers observed are:
      
      IRQs nicely manually bound to CPUs: 56.20K
      The current kernel implementation of IRQ movement: 50.05K
       -----------------------
       The static binding of IRQs has performed 12.28% better than the
      current IRQ movement implemented in the kernel.
      
      * The current implementation does not distinguish siblings from the
      HT (Hyper-Threading(tm)) enabled CPUs. It will be beneficial to
      balance the interrupt load with respect to processor packages first,
      and then among logical CPUs inside processor packages.
        For example if we have 2 heavy interrupt sources and 2 processor
      packages (4 logical CPUs); Assigning both the heavy interrupt sources
      in different processor packages is better, it will use different
      execution resources from the different processor packages.
      
      
      
      New revised implementation:
      ==========================
      We also have been working on a new implementation. The following
      points are in main focus.
      
      * At any moment heavily loaded IRQs are distributed to different
      CPUs to achieve as much balance as possible.
      
      * Lightly loaded interrupt sources are ignored from the load
      balancing, as they do not cause considerable imbalance.
      
      * When the heavy interrupt sources are balanced, they are not moved
      around. This also helps in keeping the CPU caches warm.
      
      * It has been made HT aware. While distributing the load, the load
      on a processor package to which the logical CPUs belong to is also
      considered.
      
      * In the situations of few (lesser than num_cpus) heavy interrupt
      sources, it is not possible to balance them evenly. In such case
      the existing code has been reused to move the interrupts. The
      randomness from the original code has been removed.
      
      * The time interval for redistribution has been made flexible. It
      varies as the system interrupt load changes.
      
      * A new kernel_thread is introduced to do the load balancing
      calculations for all the interrupt sources. It keeps the balanace_maps
      ready for interrupt handlers, keeping the overhead in the interrupt
      handling to minimum.
      
      * It allows the disabling of the IRQ distribution from the boot loader
      command line, if anybody wants to do it for any reason.
      
      * The algorithm also takes into account the static binding of
      interrupts to CPUs that user imposes from the
      /proc/irq/{n}/smp_affinity interface.
      
      
      Throughput numbers with the netperf setup for the new implementation:
      
      Current kernel IRQ balance implementation: 50.02K transactions/sec
      The new IRQ balance implementation: 56.01K transactions/sec
       ---------------------
        The performance improvement on P4 Xeon of 11.9% is observed.
      
      The new IRQ balance implementation also shows little performance
      improvement on P6 (Pentium II, III) systems.
      
      On a P6 system the netperf throughput numbers are:
      Current kernel IRQ balance implementation: 36.96K transactions/sec
      The new IRQ balance implementation: 37.65K transactions/sec
       ---------------------
      Here the performance improvement on P6 system of about 2% is observed.
      
      
       ---------------------
      
      Andrew Theurer <habanero@us.ibm.com> did some testing of this patch on a quad
      P4:
      
      
      I got a chance to run the NetBench benchmark with your patch on 2.5.54-mjb2
      kernel.  NetBench measures SMB/CIFS performance by using several SMB
      clients  (in this case 44 Windows 2000 systems), sending SMB requests to a
      Linux  server running Samba 2.2.3a+sendfile.  Result is in throughput,
      Mbps.   Generally the network traffic on the server is 60% recv, 40% tx.
      
      I believe we have very similar systems.  Mine is a 4 x 1.6 GHz, 1 MB L3 P4
      Xeon with 4 GB DDR memory (3.2 GB/sec I believe).  The chipset is "Summit".
       I also have more than one Intel e1000 adapters.
      
      I decided to run a few configurations, first with just one adapter, with
      and  without HT support in the kernel (acpi=off), then add another adapter
      and  test again with/without HT.
      
      Here are the results:
      
      4P, no HT, 1 x e1000, no kirq:	1214 Mbps, 4% idle
      4P, no HT, 1 x e1000, kirq:		1223 Mbps, 4% idle,		+0.74%
      
      I suppose we didn't see much of an improvement here because we never run
      into  the situation where more than one interrupt with a high rate is
      routed to a  single CPU on irq_balance.
      
      4P, HT, 1 x e1000, no kirq:	1214 Mbps, 25% idle
      4P, HT, 1 x e1000, kirq:	1220 Mbps, 30% idle,			+0.49%
      
      Again, not much of a difference just yet, but lots of idle time.  We may
      have  reached the limit at which one logical CPU can process interrupts for
      an  e1000 adapter.  There are other things I can probably do to help this,
      like  int delay, and NAPI, which I will get to eventually.
      
      4P, HT, 2 x e1000, no kirq:	1269 Mbps, 23% idle
      4P, HT, 2 x e1000, kirq:	1329 Mbps, 18% idle			+4.7%
      
      OK, almost 5% better!  Probably has to do with a couple of things; the fact
      that your code does not route two different interrupts to the same
      core/different logical cpus (quite obvious by looking at /proc/interrupts),
      and that more than one interrupt does not go to the same cpu if possible.
      I  suspect irq_balance did some of those [bad] things some of the time, and
      we  observed a bottleneck in int processing that was lower than with kirq.
      
      I don't think all of the idle time is because of a int processing
      bottleneck.   I'm just not sure what it is yet :)  Hopefully something will
      become obvious  to me...
      
      Overall I like the way it works, and I believe it can be tweaked to work
      with  NUMA when necessary.  I hope to have access to a specweb system on a
      NUMA box  soon, so we can verify that.
      08f16f8f
    • Andrew Morton's avatar
      [PATCH] Fix SMP race betwen __sync_single_inode and · 50d49a05
      Andrew Morton authored
      Patch from Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
      
      there's a SMP race condition between __sync_single_inode (or __sync_one on
      2.4.20) and __mark_inode_dirty. __mark_inode_dirty doesn't take inode
      spinlock. As we know -- unless you take a spinlock or use barrier,
      processor can change order of instructions.
      
      CPU 1
      
      modify inode
      (but modifications are in cpu-local
      buffer and do not go to bus)
      
      calls
      __mark_inode_dirty
      it sees I_DIRTY and exits immediatelly
      					CPU 2
      					takes spinlock
      					calls __sync_single_inode
      					inode->i_state &= ~I_DIRTY
      					writes the inode (but does not see
      					modifications by CPU 1 yet)
      
      CPU 1 flushes its write buffer to the bus
      inode is already written, clean, modifications
      done by CPU1 are lost
      
      The easiest fix would be to move the test inside spinlock in
      __mark_inode_dirty; if you do not want to suffer from performance loss,
      use the attached patches that use memory barriers to ensure ordering of
      reads and writes.
      50d49a05