1. 21 Sep, 2011 1 commit
  2. 14 Sep, 2011 1 commit
  3. 13 Sep, 2011 1 commit
  4. 08 Sep, 2011 9 commits
    • Peter Zijlstra's avatar
      posix-cpu-timers: Cure SMP accounting oddities · e8abccb7
      Peter Zijlstra authored
      David reported:
      
        Attached below is a watered-down version of rt/tst-cpuclock2.c from
        GLIBC.  Just build it with "gcc -o test test.c -lpthread -lrt" or
        similar.
      
        Run it several times, and you will see cases where the main thread
        will measure a process clock difference before and after the nanosleep
        which is smaller than the cpu-burner thread's individual thread clock
        difference.  This doesn't make any sense since the cpu-burner thread
        is part of the top-level process's thread group.
      
        I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
        64-bit binaries).
      
        For example:
      
        [davem@boricha build-x86_64-linux]$ ./test
        process: before(0.001221967) after(0.498624371) diff(497402404)
        thread:  before(0.000081692) after(0.498316431) diff(498234739)
        self:    before(0.001223521) after(0.001240219) diff(16698)
        [davem@boricha build-x86_64-linux]$
      
        The diff of 'process' should always be >= the diff of 'thread'.
      
        I make sure to wrap the 'thread' clock measurements the most tightly
        around the nanosleep() call, and that the 'process' clock measurements
        are the outer-most ones.
      
        ---
        #include <unistd.h>
        #include <stdio.h>
        #include <stdlib.h>
        #include <time.h>
        #include <fcntl.h>
        #include <string.h>
        #include <errno.h>
        #include <pthread.h>
      
        static pthread_barrier_t barrier;
      
        static void *chew_cpu(void *arg)
        {
      	  pthread_barrier_wait(&barrier);
      	  while (1)
      		  __asm__ __volatile__("" : : : "memory");
      	  return NULL;
        }
      
        int main(void)
        {
      	  clockid_t process_clock, my_thread_clock, th_clock;
      	  struct timespec process_before, process_after;
      	  struct timespec me_before, me_after;
      	  struct timespec th_before, th_after;
      	  struct timespec sleeptime;
      	  unsigned long diff;
      	  pthread_t th;
      	  int err;
      
      	  err = clock_getcpuclockid(0, &process_clock);
      	  if (err)
      		  return 1;
      
      	  err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
      	  if (err)
      		  return 1;
      
      	  pthread_barrier_init(&barrier, NULL, 2);
      	  err = pthread_create(&th, NULL, chew_cpu, NULL);
      	  if (err)
      		  return 1;
      
      	  err = pthread_getcpuclockid(th, &th_clock);
      	  if (err)
      		  return 1;
      
      	  pthread_barrier_wait(&barrier);
      
      	  err = clock_gettime(process_clock, &process_before);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(my_thread_clock, &me_before);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(th_clock, &th_before);
      	  if (err)
      		  return 1;
      
      	  sleeptime.tv_sec = 0;
      	  sleeptime.tv_nsec = 500000000;
      	  nanosleep(&sleeptime, NULL);
      
      	  err = clock_gettime(th_clock, &th_after);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(my_thread_clock, &me_after);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(process_clock, &process_after);
      	  if (err)
      		  return 1;
      
      	  diff = process_after.tv_nsec - process_before.tv_nsec;
      	  printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 process_before.tv_sec, process_before.tv_nsec,
      		 process_after.tv_sec, process_after.tv_nsec, diff);
      	  diff = th_after.tv_nsec - th_before.tv_nsec;
      	  printf("thread:  before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 th_before.tv_sec, th_before.tv_nsec,
      		 th_after.tv_sec, th_after.tv_nsec, diff);
      	  diff = me_after.tv_nsec - me_before.tv_nsec;
      	  printf("self:    before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 me_before.tv_sec, me_before.tv_nsec,
      		 me_after.tv_sec, me_after.tv_nsec, diff);
      
      	  return 0;
        }
      
      This is due to us using p->se.sum_exec_runtime in
      thread_group_cputime() where we iterate the thread group and sum all
      data. This does not take time since the last schedule operation (tick
      or otherwise) into account. We can cure this by using
      task_sched_runtime() at the cost of having to take locks.
      
      This also means we can (and must) do away with
      thread_group_sched_runtime() since the modified thread_group_cputime()
      is now more accurate and would deadlock when called from
      thread_group_sched_runtime().
      Reported-by: default avatarDavid Miller <davem@davemloft.net>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins
      Cc: stable@kernel.org
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      e8abccb7
    • Martin Schwidefsky's avatar
      s390: Use direct ktime path for s390 clockevent device · 4f37a68c
      Martin Schwidefsky authored
      The clock comparator on s390 uses the same format as the TOD clock.
      If the value in the clock comparator is smaller than the current TOD
      value an interrupt is pending. Use the CLOCK_EVT_FEAT_KTIME feature
      to get the unmodified ktime of the next clockevent expiration and
      use it to program the clock comparator without querying the TOD clock.
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: john stultz <johnstul@us.ibm.com>
      Link: http://lkml.kernel.org/r/20110823133143.153017933@de.ibm.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      4f37a68c
    • Martin Schwidefsky's avatar
      clockevents: Add direct ktime programming function · 65516f8a
      Martin Schwidefsky authored
      There is at least one architecture (s390) with a sane clockevent device
      that can be programmed with the equivalent of a ktime. No need to create
      a delta against the current time, the ktime can be used directly.
      
      A new clock device function 'set_next_ktime' is introduced that is called
      with the unmodified ktime for the timer if the clock event device has the 
      CLOCK_EVT_FEAT_KTIME bit set.
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: john stultz <johnstul@us.ibm.com>
      Link: http://lkml.kernel.org/r/20110823133142.815350967@de.ibm.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      65516f8a
    • Martin Schwidefsky's avatar
      clockevents: Make minimum delay adjustments configurable · d1748302
      Martin Schwidefsky authored
      The automatic increase of the min_delta_ns of a clockevents device
      should be done in the clockevents code as the minimum delay is an
      attribute of the clockevents device.
      
      In addition not all architectures want the automatic adjustment, on a
      massively virtualized system it can happen that the programming of a
      clock event fails several times in a row because the virtual cpu has
      been rescheduled quickly enough. In that case the minimum delay will
      erroneously be increased with no way back. The new config symbol
      GENERIC_CLOCKEVENTS_MIN_ADJUST is used to enable the automatic
      adjustment. The config option is selected only for x86.
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: john stultz <johnstul@us.ibm.com>
      Link: http://lkml.kernel.org/r/20110823133142.494157493@de.ibm.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      d1748302
    • Heiko Carstens's avatar
      nohz: Remove "Switched to NOHz mode" debugging messages · 29c158e8
      Heiko Carstens authored
      When performing cpu hotplug tests the kernel printk log buffer gets flooded
      with pointless "Switched to NOHz mode..." messages. Especially when afterwards
      analyzing a dump this might have removed more interesting stuff out of the
      buffer.
      Assuming that switching to NOHz mode simply works just remove the printk.
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Link: http://lkml.kernel.org/r/20110823112046.GB2540@osiris.boeblingen.de.ibm.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      29c158e8
    • Michal Hocko's avatar
      proc: Consider NO_HZ when printing idle and iowait times · a25cac51
      Michal Hocko authored
      show_stat handler of the /proc/stat file relies on kstat_cpu(cpu)
      statistics when priting information about idle and iowait times.
      This is OK if we are not using tickless kernel (CONFIG_NO_HZ) because
      counters are updated periodically.
      With NO_HZ things got more tricky because we are not doing idle/iowait
      accounting while we are tickless so the value might get outdated.
      Users of /proc/stat will notice that by unchanged idle/iowait values
      which is then interpreted as 0% idle/iowait time. From the user space
      POV this is an unexpected behavior and a change of the interface.
      
      Let's fix this by using get_cpu_{idle,iowait}_time_us which accounts the
      total idle/iowait time since boot and it doesn't rely on sampling or any
      other periodic activity. Fall back to the previous behavior if NO_HZ is
      disabled or not configured.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Link: http://lkml.kernel.org/r/39181366adac1b39cb6aa3cd53ff0f7c78d32676.1314172057.git.mhocko@suse.czSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      a25cac51
    • Michal Hocko's avatar
      nohz: Make idle/iowait counter update conditional · 09a1d34f
      Michal Hocko authored
      get_cpu_{idle,iowait}_time_us update idle/iowait counters
      unconditionally if the given CPU is in the idle loop.
      
      This doesn't work well outside of CPU governors which are singletons
      so nobody (except for IRQ) can race with them.
      
      We will need to use both functions from /proc/stat handler to properly
      handle nohz idle/iowait times.
      
      Make the update depend on a non NULL last_update_time argument.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Link: http://lkml.kernel.org/r/11f23179472635ce52e78921d47a20216b872f23.1314172057.git.mhocko@suse.czSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      09a1d34f
    • Michal Hocko's avatar
      nohz: Fix update_ts_time_stat idle accounting · 6beea0cd
      Michal Hocko authored
      update_ts_time_stat currently updates idle time even if we are in
      iowait loop at the moment. The only real users of the idle counter
      (via get_cpu_idle_time_us) are CPU governors and they expect to get
      cumulative time for both idle and iowait times.
      The value (idle_sleeptime) is also printed to userspace by print_cpu
      but it prints both idle and iowait times so the idle part is misleading.
      
      Let's clean this up and fix update_ts_time_stat to account both counters
      properly and update consumers of idle to consider iowait time as well.
      If we do this we might use get_cpu_{idle,iowait}_time_us from other
      contexts as well and we will get expected values.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Link: http://lkml.kernel.org/r/e9c909c221a8da402c4da07e4cd968c3218f8eb1.1314172057.git.mhocko@suse.czSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      6beea0cd
    • Michal Hocko's avatar
      cputime: Clean up cputime_to_usecs and usecs_to_cputime macros · ef0e0f5e
      Michal Hocko authored
      Get rid of semicolon so that those expressions can be used also
      somewhere else than just in an assignment.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Link: http://lkml.kernel.org/r/7565417ce30d7e6b1ddc169843af0777dbf66e75.1314172057.git.mhocko@suse.czSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      ef0e0f5e
  5. 10 Aug, 2011 11 commits
  6. 08 Aug, 2011 1 commit
  7. 07 Aug, 2011 16 commits
    • Linus Torvalds's avatar
      9e233113
    • Rafael J. Wysocki's avatar
      sh: Fix boot crash related to SCI · fc97114b
      Rafael J. Wysocki authored
      Commit d006199e72a9 ("serial: sh-sci: Regtype probing doesn't need to be
      fatal.") made sci_init_single() return when sci_probe_regmap() succeeds,
      although it should return when sci_probe_regmap() fails.  This causes
      systems using the serial sh-sci driver to crash during boot.
      
      Fix the problem by using the right return condition.
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc97114b
    • Linus Torvalds's avatar
      arm: remove stale export of 'sha_transform' · f23c126b
      Linus Torvalds authored
      The generic library code already exports the generic function, this was
      left-over from the ARM-specific version that just got removed.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f23c126b
    • Linus Torvalds's avatar
      arm: remove "optimized" SHA1 routines · 4d448714
      Linus Torvalds authored
      Since commit 1eb19a12 ("lib/sha1: use the git implementation of
      SHA-1"), the ARM SHA1 routines no longer work.  The reason? They
      depended on the larger 320-byte workspace, and now the sha1 workspace is
      just 16 words (64 bytes).  So the assembly version would overwrite the
      stack randomly.
      
      The optimized asm version is also probably slower than the new improved
      C version, so there's no reason to keep it around.  At least that was
      the case in git, where what appears to be the same assembly language
      version was removed two years ago because the optimized C BLK_SHA1 code
      was faster.
      Reported-and-tested-by: default avatarJoachim Eastwood <manabian@gmail.com>
      Cc: Andreas Schwab <schwab@linux-m68k.org>
      Cc: Nicolas Pitre <nico@fluxnic.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4d448714
    • Al Viro's avatar
      fix rcu annotations noise in cred.h · 32955148
      Al Viro authored
      task->cred is declared as __rcu, and access to other tasks' ->cred is,
      indeed, protected.  Access to current->cred does not need rcu_dereference()
      at all, since only the task itself can change its ->cred.  sparse, of
      course, has no way of knowing that...
      
      Add force-cast in current_cred(), make current_fsuid() et.al. use it.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      32955148
    • Linus Torvalds's avatar
      vfs: rename 'do_follow_link' to 'should_follow_link' · 7813b94a
      Linus Torvalds authored
      Al points out that the do_follow_link() helper function really is
      misnamed - it's about whether we should try to follow a symlink or not,
      not about actually doing the following.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7813b94a
    • Ari Savolainen's avatar
      Fix POSIX ACL permission check · 206b1d09
      Ari Savolainen authored
      After commit 3567866b: "RCUify freeing acls, let check_acl() go ahead in
      RCU mode if acl is cached" posix_acl_permission is being called with an
      unsupported flag and the permission check fails. This patch fixes the issue.
      Signed-off-by: default avatarAri Savolainen <ari.m.savolainen@gmail.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      206b1d09
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd · c2f340a6
      Linus Torvalds authored
      * 'for-linus' of git://git.open-osd.org/linux-open-osd:
        ore: Make ore its own module
        exofs: Rename raid engine from exofs/ios.c => ore
        exofs: ios: Move to a per inode components & device-table
        exofs: Move exofs specific osd operations out of ios.c
        exofs: Add offset/length to exofs_get_io_state
        exofs: Fix truncate for the raid-groups case
        exofs: Small cleanup of exofs_fill_super
        exofs: BUG: Avoid sbi realloc
        exofs: Remove pnfs-osd private definitions
        nfs_xdr: Move nfs4_string definition out of #ifdef CONFIG_NFS_V4
      c2f340a6
    • Linus Torvalds's avatar
      vfs: optimize inode cache access patterns · 3ddcd056
      Linus Torvalds authored
      The inode structure layout is largely random, and some of the vfs paths
      really do care.  The path lookup in particular is already quite D$
      intensive, and profiles show that accessing the 'inode->i_op->xyz'
      fields is quite costly.
      
      We already optimized the dcache to not unnecessarily load the d_op
      structure for members that are often NULL using the DCACHE_OP_xyz bits
      in dentry->d_flags, and this does something very similar for the inode
      ops that are used during pathname lookup.
      
      It also re-orders the fields so that the fields accessed by 'stat' are
      together at the beginning of the inode structure, and roughly in the
      order accessed.
      
      The effect of this seems to be in the 1-2% range for an empty kernel
      "make -j" run (which is fairly kernel-intensive, mostly in filename
      lookup), so it's visible.  The numbers are fairly noisy, though, and
      likely depend a lot on exact microarchitecture.  So there's more tuning
      to be done.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ddcd056
    • Linus Torvalds's avatar
      vfs: renumber DCACHE_xyz flags, remove some stale ones · 830c0f0e
      Linus Torvalds authored
      Gcc tends to generate better code with small integers, including the
      DCACHE_xyz flag tests - so move the common ones to be first in the list.
      Also just remove the unused DCACHE_INOTIFY_PARENT_WATCHED and
      DCACHE_AUTOFS_PENDING values, their users no longer exists in the source
      tree.
      
      And add a "unlikely()" to the DCACHE_OP_COMPARE test, since we want the
      common case to be a nice straight-line fall-through.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      830c0f0e
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 7cd4767e
      Linus Torvalds authored
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        net: Compute protocol sequence numbers and fragment IDs using MD5.
        crypto: Move md5_transform to lib/md5.c
      7cd4767e
    • Boaz Harrosh's avatar
      ore: Make ore its own module · cf283ade
      Boaz Harrosh authored
      Export everything from ore need exporting. Change Kbuild and Kconfig
      to build ore.ko as an independent module. Import ore from exofs
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      cf283ade
    • Boaz Harrosh's avatar
      exofs: Rename raid engine from exofs/ios.c => ore · 8ff660ab
      Boaz Harrosh authored
      ORE stands for "Objects Raid Engine"
      
      This patch is a mechanical rename of everything that was in ios.c
      and its API declaration to an ore.c and an osd_ore.h header. The ore
      engine will later be used by the pnfs objects layout driver.
      
      * File ios.c => ore.c
      
      * Declaration of types and API are moved from exofs.h to a new
        osd_ore.h
      
      * All used types are prefixed by ore_ from their exofs_ name.
      
      * Shift includes from exofs.h to osd_ore.h so osd_ore.h is
        independent, include it from exofs.h.
      
      Other than a pure rename there are no other changes. Next patch
      will move the ore into it's own module and will export the API
      to be used by exofs and later the layout driver
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      8ff660ab
    • Boaz Harrosh's avatar
      exofs: ios: Move to a per inode components & device-table · 9e9db456
      Boaz Harrosh authored
      Exofs raid engine was saving on memory space by having a single layout-info,
      single pid, and a single device-table, global to the filesystem. Then passing
      a credential and object_id info at the io_state level, private for each
      inode. It would also devise this contraption of rotating the device table
      view for each inode->ino to spread out the device usage.
      
      This is not compatible with the pnfs-objects standard, demanding that
      each inode can have it's own layout-info, device-table, and each object
      component it's own pid, oid and creds.
      
      So: Bring exofs raid engine to be usable for generic pnfs-objects use by:
      
      * Define an exofs_comp structure that holds obj_id and credential info.
      
      * Break up exofs_layout struct to an exofs_components structure that holds a
        possible array of exofs_comp and the array of devices + the size of the
        arrays.
      
      * Add a "comps" parameter to get_io_state() that specifies the ids creds
        and device array to use for each IO.
      
        This enables to keep the layout global, but the device-table view, creds
        and IDs at the inode level. It only adds two 64bit to each inode, since
        some of these members already existed in another form.
      
      * ios raid engine now access layout-info and comps-info through the passed
        pointers. Everything is pre-prepared by caller for generic access of
        these structures and arrays.
      
      At the exofs Level:
      
      * Super block holds an exofs_components struct that holds the device
        array, previously in layout. The devices there are in device-table
        order. The device-array is twice bigger and repeats the device-table
        twice so now each inode's device array can point to a random device
        and have a round-robin view of the table, making it compatible to
        previous exofs versions.
      
      * Each inode has an exofs_components struct that is initialized at
        load time, with it's own view of the device table IDs and creds.
        When doing IO this gets passed to the io_state together with the
        layout.
      
      While preforming this change. Bugs where found where credentials with the
      wrong IDs where used to access the different SB objects (super.c). As well
      as some dead code. It was never noticed because the target we use does not
      check the credentials.
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      9e9db456
    • Boaz Harrosh's avatar
      exofs: Move exofs specific osd operations out of ios.c · 85e44df4
      Boaz Harrosh authored
      ios.c will be moving to an external library, for use by the
      objects-layout-driver. Remove from it some exofs specific functions.
      
      Also g_attr_logical_length is used both by inode.c and ios.c
      move definition to the later, to keep it independent
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      85e44df4
    • Boaz Harrosh's avatar
      exofs: Add offset/length to exofs_get_io_state · e1042ba0
      Boaz Harrosh authored
      In future raid code we will need to know the IO offset/length
      and if it's a read or write to determine some of the array
      sizes we'll need.
      
      So add a new exofs_get_rw_state() API for use when
      writeing/reading. All other simple cases are left using the
      old way.
      
      The major change to this is that now we need to call
      exofs_get_io_state later at inode.c::read_exec and
      inode.c::write_exec when we actually know these things. So this
      patch is kept separate so I can test things apart from other
      changes.
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      e1042ba0