1. 25 May, 2003 40 commits
    • Andrew Morton's avatar
      [PATCH] Documentation for disk iostats · 0af6874c
      Andrew Morton authored
      From: Rick Lindsley <ricklind@us.ibm.com>
      
      Here is a file to add to the Documentation/ directory which describes the
      disk statistics fields.
      0af6874c
    • Andrew Morton's avatar
      [PATCH] misc fixes · 4a81c9aa
      Andrew Morton authored
      - Add an explanation for clearing the focus bit on P4 (zwane)
      
      - __d_path kerneldoc fix (John Levon)
      
      - generic-hdlc documentation fix (Krzysztof Halasa <khc@pm.waw.pl>)
      
      - cmdline_read_proc cleanup (Oleg Drokin)
      
      - remove a couple of unused vars from drivers/ide/pci/hpt366.c
      
      - sound/core/sgbuf.c needs mm.h at least on alpha, for mem_map and other
        page stuff.  (Ivan Kokshaysky <ink@jurassic.park.msu.ru>)
      
      - Don't use "u32 long" in cs46xx.c (Kevin Puetz <puetzk@puetzk.org>)
      
      - fs/nfs/nfs4xdr.c warning fix: all the `goto out;' statements are
        commented away, so comment away the label too.
      
      - net/ipv6/af_inet6.c: remove unused var
      
      - drivers/media/video/bttv-cards.c: jiffies are unsigned long
      
      - drivers/media/video/saa7134/saa7134-cards.c: unused var
      
      - Fix Documentation/Changes comment wrt sparc compiler version
      
      - drivers/pnp/quirks.c needs slab.h for kfree().  (Daniele Bellucci
        <bellucda@tiscali.it>)
      4a81c9aa
    • Andrew Morton's avatar
      [PATCH] extend-check_valid_hugepage_range.patch · 76e5699d
      Andrew Morton authored
      From: David Gibson <david@gibson.dropbear.id.au>
      
      
      Renames check_valid_hugepage_range() to is_hugepage_only_range(), which makes
      more sense.
      76e5699d
    • Andrew Morton's avatar
      [PATCH] add notify_count for de_thread · 73accc3d
      Andrew Morton authored
      From: Manfred Spraul <manfred@colorfullife.com>
      
      de_thread is called by exec to kill all threads in the thread group except
      the threads required for exec.
      
      The waiting is implemented by waiting for a wakeup from __exit_signal: If
      the reference count is less or equal to 2, then the waiter is woken up.  If
      exec is called by a non-leader thread, then two threads are required for
      exec.
      
      But if a thread group leader calls exec, then only one thread is required
      for exec.  Thus the hardcoded "2" leads to a superfluous wakeup.  The patch
      fixes that by adding a "notify_count" field to the signal structure.
      73accc3d
    • Andrew Morton's avatar
      [PATCH] net/sunrpc/sunrpc_syms.c typo fix · 9ee208ea
      Andrew Morton authored
      From: Frank Cusack <fcusack@fcusack.com>
      
      net/sunrpc/sunrpc_syms.c typo fix
      9ee208ea
    • Andrew Morton's avatar
      [PATCH] overcommit root margin · cf50f395
      Andrew Morton authored
      From: Dave Hansen <haveblue@us.ibm.com>
      
      This patch makes vm_enough_memory(), more likely return failure when
      overcommit_memory==0 and !CAP_SYS_ADMIN.  I'm not sure it's worth having
      another tunable just for this.
      
      I also reworked the documentation a bit.  It should be a lot clearer to
      read now.
      cf50f395
    • Andrew Morton's avatar
      [PATCH] devpts xattr handler for security labels · 4a3fbc84
      Andrew Morton authored
      From: Stephen Smalley <sds@epoch.ncsc.mil>
      
      This patch against 2.5.69-bk adds an xattr handler for security labels
      to devpts and corresponding hooks to the LSM API to support conversion
      between xattr values and the security labels stored in the inode
      security field by the security module.
      
      This allows userspace to get and set the security labels on devpts
      nodes, e.g.  so that sshd can set the security label for the pty using
      setxattr, just as sshd already sets the ownership using chown.
      
      SELinux uses this support to protect the pty in accordance with the user
      process' security label.  The changes to the LSM API are general and
      should be re-useable by xattr handlers in other pseudo filesystems to
      support similar security labeling.  The xattr handler for devpts
      includes the same generic framework as in ext[23], so handlers for other
      kinds of attributes can be added easily in the future.
      4a3fbc84
    • Andrew Morton's avatar
      [PATCH] CONFIG_EPOLL · fb39f360
      Andrew Morton authored
      From: Christopher Hoover <ch@murgatroid.com>
      
      Here's a patch to drop some more text/data/bss out of 2.5.  This time
      the ``victim'' is eventpollfs (epoll).
      fb39f360
    • Andrew Morton's avatar
      [PATCH] CONFIG_FUTEX · e8c0de6e
      Andrew Morton authored
      From: Christopher Hoover <ch@murgatroid.com>
      
      Not everyone needs futex support, so it should be optional.  This is needed
      for small platforms.
      e8c0de6e
    • Andrew Morton's avatar
      [PATCH] /proc/pid inode security labels · 20378c29
      Andrew Morton authored
      From: Stephen Smalley <sds@epoch.ncsc.mil>
      
      This patch against 2.5.69-bk adds a hook to proc_pid_make_inode to allow
      security modules to set the security attributes on /proc/pid inodes based on
      the security attributes of the associated task.  This is required by SELinux
      in order to control access to the process state accessible via /proc/pid
      inodes in accordance with the task's security label.
      
      An alternative approach that was considered was to implement an xattr handler
      for /proc/pid inodes.  That approach would still require a hook call from the
      xattr handler to the security module to obtain an xattr value based on the
      task security attributes, so it would add a further level of
      indirection/translation.  The only benefit of implementing an xattr handler
      for the /proc/pid inodes would be that the /proc/pid inode security labels
      could then be exported to userspace.  However, the /proc/pid inode security
      labels are only used internally by the security module for access control
      purposes, and userspace access to the full range of process attributes is
      already provided via the /proc/pid/attr interface.  Consequently, a simple
      hook in proc_pid_make_inode seemed preferable.
      20378c29
    • Andrew Morton's avatar
      [PATCH] Process Attribute API for Security Modules (fixlet) · 09d35c2a
      Andrew Morton authored
      From: Stephen Smalley <sds@epoch.ncsc.mil>
      
      This patch, relative to the /proc/pid/attr patch against 2.5.69, fixes the
      mode values of the /proc/pid/attr nodes to avoid interference by the normal
      Linux access checks for these nodes (and also fixes the /proc/pid/attr/prev
      mode to reflect its read-only nature).
      
      Otherwise, when the dumpable flag is cleared by a set[ug]id or unreadable
      executable, a process will lose the ability to set its own attributes via
      writes to /proc/pid/attr due to a DAC failure (/proc/pid inodes are
      assigned the root uid/gid if the task is not dumpable, and the original
      mode only permitted the owner to write).
      
      The security module should implement appropriate permission checking in its
      [gs]etprocattr hook functions.  In the case of SELinux, the setprocattr
      hook function only allows a process to write to its own /proc/pid/attr
      nodes as well as imposing other policy-based restrictions, and the
      getprocattr hook function performs a permission check between the security
      labels of the current process and target process to determine whether the
      operation is permitted.
      09d35c2a
    • Andrew Morton's avatar
      [PATCH] Process Attribute API for Security Modules · ea7870c8
      Andrew Morton authored
      From: Stephen Smalley <sds@epoch.ncsc.mil>
      
      This updated patch against 2.5.69 merges the readdir and lookup routines
      for proc_base and proc_attr, fixes the copy_to_user call in proc_attr_read
      and proc_info_read, moves the new data and code within CONFIG_SECURITY, and
      uses ARRAY_SIZE, per the comments from Al Viro and Andrew Morton.  As
      before, this patch implements a process attribute API for security modules
      via a set of nodes in a /proc/pid/attr directory.  Credit for the idea of
      implementing this API via /proc/pid/attr nodes goes to Al Viro.  Jan Harkes
      provided a nice cleanup of the implementation to reduce the code bloat.
      ea7870c8
    • Andrew Morton's avatar
      [PATCH] mark shrinkable slabs as being reclaimable · 6f333c22
      Andrew Morton authored
      All slabs which can be reclaimed via VM presure are marked as being
      shrinkable, so the core slab code will keep count of their pages.
      
      Except for the one in XFS.  It has strange wrapper stuff.
      6f333c22
    • Andrew Morton's avatar
      [PATCH] slab: account for reclaimable caches · 8f542f30
      Andrew Morton authored
      We have a problem at present in vm_enough_memory(): it uses smoke-n-mirrors
      to try to work out how much memory can be reclaimed from dcache and icache.
      it sometimes gets it quite wrong, especially if the slab has internal
      fragmentation.  And it often does.
      
      So here we take a new approach.  Rather than trying to work out how many
      pages are reclaimable by counting up the number of inodes and dentries, we
      change the slab allocator to keep count of how many pages are currently used
      by slabs which can be shrunk by the VM.
      
      The creator of the slab marks the slab as being reclaimable at
      kmem_cache_create()-time.  Slab keeps a global counter of pages which are
      currently in use by thus-tagged slabs.
      
      Of course, we now slightly overestimate the amount of reclaimable memory,
      because not _all_ of the icache, dcache, mbcache and quota caches are
      reclaimable.
      
      But I think it's better to be a bit permissive rather than bogusly failing
      brk() calls as we do at present.
      8f542f30
    • Andrew Morton's avatar
      [PATCH] Don't remove inode from hash until filesystem has · d6686d54
      Andrew Morton authored
      From: Neil Brown <neilb@cse.unsw.edu.au>
      
      When an NFS request arrives, it contains a filehandle which needs to be
      converted to a dentry.  Many filesystems use find_exported_dentry in
      fs/exportfs/expfs.c.  A key part of this on filesystem where a 32bit inode
      number uniquely locates a file is export_iget which calls iget(sb, inum).
      
      iget will either:
      
         1/ find the inode in the inode cache and return it
      
       or
      
         2/ create a new inode and call ->read_inode to load it from the
            storage device.
      
      export_iget then verifies the inode is really a good inode (->read_inode
      didn't detect any problems) and the right inode (base on generation number
      from the file handle).
      
      For this to work reliably, it is important that whenever an inode is *not* in
      the cache, the on-device version is up-to-date.  Otherwise, when read_inode
      loads the inode it will get bad data.
      
      For a file that has not been deleted, this condition always holds: a dirty
      inode is always flushed to disc before the inode is unhashed.
      
      However for a file that is being deleted this condition doesn't (didn't)
      hold.  When iput -> iput_final -> generic_drop_inode -> generic_delete_inode
      is called we would unhash the inode before calling into the filesytem through
      ->delete_inode.
      
      So there is a small window between when generic_delete_inode unhashes the
      inode, and when ->delete_inode writes something to disc, where a call to
      ->read_inode (for export_iget) might discover what it thinks is a valid
      inode, but is really one that is in the process of being destroyed.
      
      It is this window that I want to close by moving the unhashing to the end of
      generic_delete_inode.
      d6686d54
    • Andrew Morton's avatar
      [PATCH] Fix readdir error return value · 2eb4051e
      Andrew Morton authored
      From: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      
      There are a couple of places in the readdir code where it forgets to set
      the returned error code to -EFAULT, leaving it at the default -EINVAL.
      
      Fix that up, and rename getdents_callback64.count to "result", which makes
      more sense.
      2eb4051e
    • Andrew Morton's avatar
      [PATCH] xirc2ps_cs irq return fix · 61a6c177
      Andrew Morton authored
      From zwane
      
      We shutdown the MAC part of the card and have interrupts disabled, interrupt
      gets queued, we reenable interrupts after shutting down device, service the
      interrupt, check status and get 0xff from powered down device.
      
      No idea what he's talking about here, but apparently the irq return handling
      isn't working out.  Just return IRQ_HANDLED all the time.
      61a6c177
    • Andrew Morton's avatar
      [PATCH] reiserfs: inode attributes support. · 37c90629
      Andrew Morton authored
      From: Oleg Drokin <green@namesys.com>
      
      This is a forward port of 2.4's inode attributes support for reiserfs.
      Original implementation for 2.4 was performed by Nikita Danilov.
      
      In order to enable this support, one must use "attrs" mount options, eg:
      
      	mount /dev/hda1 /mount/pont -t reiserfs -o attrs
      
      Also either the filesystem must have been created with a recent mkreiserfs
      or must have been modified by a recent version of reiserfsck with its
      "--clean-attributes" option.
      
      If that is not done, attributes support will not be enabled and a kernel
      message will be printed.  This is necessary because old kernels left random
      garbage in the place where these attributes now live.
      
      These attributes are totally compatible with ext2's ones.  You can
      manipulate them with chattr/lsattr etc.
      
      Additionally the chattr 'd' option may be used to disable tail packing on a
      specific file or a directory tree.  (The 'd' option normally means "don't
      dump".  reiserfs has overloaded it).
      37c90629
    • Andrew Morton's avatar
      [PATCH] APM does unsafe conditional set_cpus_allowed · 0c85cefd
      Andrew Morton authored
      From: Zwane Mwaikambo <zwane@linuxpower.ca>
      
      kapmd does a conditional check in order to decide whether to set the task's
      cpu affinity mask.  This can change during runtime, therefore we
      unconditionally set it.  There is an early exit in set_cpus_allowed if the
      current processor is in the allowed mask anyway.
      0c85cefd
    • Andrew Morton's avatar
      [PATCH] Fix dcache_lock/tasklist_lock ranking bug · 055e188d
      Andrew Morton authored
      __unhash_process acquires the dcache_lock while holding the
      tasklist_lock for writing. This can deadlock. Additionally,
      fs/proc/base.c incorrectly assumed that p->pid would be set to 0 during
      release_task.
      
      The patch fixes that by adding a new spinlock to the task structure and
      fixing all references to (!p->pid).
      
      The alternative to the new spinlock would be to hold dcache_lock around
      __unhash_process.
      
      - fs/proc/base.c assumed that p->pid is reset to 0 during exit.  This is
        not the case anymore.  I now look at the count of the pid structure for
        PIDTYPE_PID.
      
      - de_thread now tested - as broken as it was before: open handles to
        /proc/<pid> are either stale or invalid after an exec of a nptl process,
        if the exec was call from a secondary thread.
      
      - a few lock_kernels removed - that part of /proc doesn't need it.
      
      - additional instances of 'if(current->pid)' replaced with pid_alive.
      055e188d
    • Andrew Morton's avatar
      [PATCH] arch/i386/kernel/mpparse.c warning fixes · 05cdeac3
      Andrew Morton authored
      From: William Lee Irwin III <wli@holomorphy.com>
      
      mpc_apicid is a u8, and MAX_APICS can be 256.
      05cdeac3
    • Andrew Morton's avatar
      [PATCH] siocdevprivate_ioctl warning fix · 2a52198b
      Andrew Morton authored
      fs/compat.c: In function `compat_sys_ioctl':
      fs/compat.c:324: warning: implicit declaration of function `siocdevprivate_ioctl'
      2a52198b
    • Andrew Morton's avatar
      [PATCH] tty_io warning fix · 396382dc
      Andrew Morton authored
      Don't assume the size of dev_t: on ppc64 it is unsignedlong and this
      generates a printk warning.
      396382dc
    • Andrew Morton's avatar
      [PATCH] ppc64: more warning fixes · 4a6e2172
      Andrew Morton authored
      arch/ppc64/kernel/htab.c:105: warning: implicit declaration of function `pSeries_lpar_hpte_insert'
      arch/ppc64/kernel/htab.c:109: warning: implicit declaration of function `pSeries_hpte_insert'
      4a6e2172
    • Andrew Morton's avatar
      [PATCH] ppc64: arch/ppc64/kernel/traps.c warning fixes · c5ef8de3
      Andrew Morton authored
      Fix a printk warning
      c5ef8de3
    • Andrew Morton's avatar
      [PATCH] ppc64: nail warnings in arch/ppc64/kernel/setup.c · 83599e3c
      Andrew Morton authored
      two printk warnings
      83599e3c
    • Andrew Morton's avatar
      [PATCH] ppc64: ioctl32 warning fix · e806a036
      Andrew Morton authored
      warning: assignment makes pointer from integer without a cast
      e806a036
    • Andrew Morton's avatar
      [PATCH] ppc64: build fix · 9b2a6123
      Andrew Morton authored
      It needs sched.h for `current'.
      9b2a6123
    • Andrew Morton's avatar
      [PATCH] ppc64: Unused variables in ppc64 prom.c · 48df450c
      Andrew Morton authored
      From: David Gibson <david@gibson.dropbear.id.au>
      
      This removes a bunch of unused variables in prom_init(), squashing the
      associated warnings.
      48df450c
    • Andrew Morton's avatar
      [PATCH] ppc64: Squash warning in ppc64 xics.c · ea8b5b2e
      Andrew Morton authored
      From: David Gibson <david@gibson.dropbear.id.au>
      
      xics.c uses ppc64_boot_msg() without prototype, this fixes it by inclding
      <asm/machdep.h>.
      ea8b5b2e
    • Andrew Morton's avatar
      [PATCH] ppc64: do_signal32 warning fix · 1cb4f432
      Andrew Morton authored
      do_signal32() is used before it is defined, this prototype squashes the
      warning.
      1cb4f432
    • Andrew Morton's avatar
      [PATCH] ppc64: Squash implicit declaration warning in ppc64 · 62c2905c
      Andrew Morton authored
      From: David Gibson <david@gibson.dropbear.id.au>
      
      Squash implicit declaration warning in ppc64 align.c
      62c2905c
    • Andrew Morton's avatar
      [PATCH] ppc64: Squash warning in ppc64 addnote tool · e6670878
      Andrew Morton authored
      From: David Gibson <david@gibson.dropbear.id.au>
      
      addnote in arch/ppc64/boot (a userspace tool, not kernel code) uses exit()
      without including stdlib.h.
      e6670878
    • Andrew Morton's avatar
      [PATCH] ppc64: PPC64 irq return fix · ffe8c05d
      Andrew Morton authored
      PPC64 irq return fix
      ffe8c05d
    • Andrew Morton's avatar
      [PATCH] ppc64: Fix some PPC64 compile warnings · d69b7c27
      Andrew Morton authored
      Fix some warnings in the ppc64 build.
      
      Also declare a couple of AIO functions in aio.h rather than aio.c They are
      needed for 32-bit emulation support.
      d69b7c27
    • Andrew Morton's avatar
      [PATCH] ppc64: 32/64bit emulation for aio · 2b748116
      Andrew Morton authored
      From: Anton Blanchard <anton@samba.org>
      
      PPC64 32/64-bit emulation for AIO.
      2b748116
    • Linus Torvalds's avatar
      Make cdev infrastructure initialize early · 276df1b2
      Linus Torvalds authored
      Very early initialization (core_initcall) needs to have the cdev
      initialization done.  So make it part of the pre-initcall sequence, the
      same way the bdev caches were done.
      276df1b2
    • Linus Torvalds's avatar
      48554ca4
    • Ingo Molnar's avatar
      [PATCH] support "requeueing" futexes · 7149345c
      Ingo Molnar authored
      This addresses a futex related SMP scalability problem of
      glibc. A number of regressions have been reported to the NTPL mailing list
      when going to many CPUs, for applications that use condition variables and
      the pthread_cond_broadcast() API call. Using this functionality, testcode
      shows a slowdown from 0.12 seconds runtime to over 237 seconds (!)
      runtime, on 4-CPU systems.
      
      pthread condition variables use two futex-backed mutex-alike locks: an
      internal one for the glibc CV state itself, and a user-supplied mutex
      which the API guarantees to take in certain codepaths. (Unfortunately the
      user-supplied mutex cannot be used to protect the CV state, so we've got
      to deal with two locks.)
      
      The cause of the slowdown is a 'swarm effect': if lots of threads are
      blocked on a condition variable, and pthread_cond_broadcast() is done,
      then glibc first does a FUTEX_WAKE on the cv-internal mutex, then down a
      mutex_down() on the user-supplied mutex. Ie. a swarm of threads is created
      which all race to serialize on the user-supplied mutex. The more threads
      are used, the more likely it becomes that the scheduler will balance them
      over to other CPUs - where they just schedule, try to lock the mutex, and
      go to sleep. This 'swarm effect' is purely technical, a side-effect of
      glibc's use of futexes, and the imperfect coupling of the two locks.
      
      the solution to this problem is to not wake up the swarm of threads, but
      'requeue' them from the CV-internal mutex to the user-supplied mutex. The
      attached patch adds the FUTEX_REQUEUE feature FUTEX_REQUEUE requeues N
      threads from futex address A to futex address B.
      
      This way glibc can wake up a single thread (which will take the
      user-mutex), and can requeue the rest, with a single system-call.
      
      Ulrich Drepper has implemented FUTEX_REQUEUE support in glibc, and a
      number of people have tested it over the past couple of weeks. Here are
      the measurements done by Saurabh Desai:
      
      System: 4xPIII 700MHz
      
       ./cond-perf -r 100 -n 200:        1p       2p         4p
       Default NPTL:                 0.120s   0.211s   237.407s
       requeue NPTL:                 0.124s   0.156s     0.040s
      
       ./cond-perf -r 1000 -n 100:
       Default NPTL:                 0.276s   0.412s     0.530s
       requeue NPTL:                 0.349s   0.503s     0.550s
      
       ./pp -v -n 128 -i 1000 -S 32768:
       Default NPTL: 128 games in    1.111s   1.270s    16.894s
       requeue NPTL: 128 games in    1.111s   1.959s     2.426s
      
       ./pp -v -n 1024 -i 10 -S 32768:
       Default NPTL: 1024 games in   0.181s   0.394s     incompleted 2m+
       requeue NPTL: 1024 games in   0.166s   0.254s     0.341s
      
      the speedup with increasing number of threads is quite significant, in the
      128 threads, case it's more than 8 times. In the cond-perf test, on 4 CPUs
      it's almost infinitely faster than the 'swarm of threads' catastrophy
      triggered by the old code.
      7149345c
    • Alexander Viro's avatar
      [PATCH] i_cdev/i_cindex · 9bda5f68
      Alexander Viro authored
      new fields in struct inode - i_cdev and i_cindex.  When we do open() on
      a character device we cache result of cdev lookup in inode and put the
      inode on a cyclic list anchored in cdev.  If we already have that done,
      we don't bother with any lookups.  When inode disappears it's removed
      from the list.  When cdev gets unregistered we remove all cached
      references to it (and remove such inodes from the list).  cdev is held
      until final fput() now.
      9bda5f68