1. 19 Nov, 2011 1 commit
    • Jeff Ohlstein's avatar
      hrtimer: Fix extra wakeups from __remove_hrtimer() · 27c9cd7e
      Jeff Ohlstein authored
      __remove_hrtimer() attempts to reprogram the clockevent device when
      the timer being removed is the next to expire. However,
      __remove_hrtimer() reprograms the clockevent *before* removing the
      timer from the timerqueue and thus when hrtimer_force_reprogram()
      finds the next timer to expire it finds the timer we're trying to
      remove.
      
      This is especially noticeable when the system switches to NOHz mode
      and the system tick is removed. The timer tick is removed from the
      system but the clockevent is programmed to wakeup in another HZ
      anyway.
      
      Silence the extra wakeup by removing the timer from the timerqueue
      before calling hrtimer_force_reprogram() so that we actually program
      the clockevent for the next timer to expire.
      
      This was broken by 998adc3d "hrtimers: Convert hrtimers to use
      timerlist infrastructure".
      Signed-off-by: default avatarJeff Ohlstein <johlstei@codeaurora.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/1321660030-8520-1-git-send-email-johlstei@codeaurora.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      27c9cd7e
  2. 17 Nov, 2011 1 commit
  3. 11 Nov, 2011 1 commit
  4. 10 Nov, 2011 1 commit
    • John Stultz's avatar
      clocksource: Avoid selecting mult values that might overflow when adjusted · d65670a7
      John Stultz authored
      For some frequencies, the clocks_calc_mult_shift() function will
      unfortunately select mult values very close to 0xffffffff.  This
      has the potential to overflow when NTP adjusts the clock, adding
      to the mult value.
      
      This patch adds a clocksource.maxadj value, which provides
      an approximation of an 11% adjustment(NTP limits adjustments to
      500ppm and the tick adjustment is limited to 10%), which could
      be made to the clocksource.mult value. This is then used to both
      check that the current mult value won't overflow/underflow, as
      well as warning us if the timekeeping_adjust() code pushes over
      that 11% boundary.
      
      v2: Fix max_adjustment calculation, and improve WARN_ONCE
      messages.
      
      v3: Don't warn before maxadj has actually been set
      
      CC: Yong Zhang <yong.zhang0@gmail.com>
      CC: David Daney <ddaney.cavm@gmail.com>
      CC: Thomas Gleixner <tglx@linutronix.de>
      CC: Chen Jie <chenj@lemote.com>
      CC: zhangfx <zhangfx@lemote.com>
      CC: stable@kernel.org
      Reported-by: default avatarChen Jie <chenj@lemote.com>
      Reported-by: default avatarzhangfx <zhangfx@lemote.com>
      Tested-by: default avatarYong Zhang <yong.zhang0@gmail.com>
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      d65670a7
  5. 28 Oct, 2011 1 commit
  6. 12 Oct, 2011 1 commit
  7. 04 Oct, 2011 2 commits
  8. 21 Sep, 2011 1 commit
  9. 14 Sep, 2011 1 commit
  10. 13 Sep, 2011 1 commit
  11. 08 Sep, 2011 9 commits
    • Peter Zijlstra's avatar
      posix-cpu-timers: Cure SMP accounting oddities · e8abccb7
      Peter Zijlstra authored
      David reported:
      
        Attached below is a watered-down version of rt/tst-cpuclock2.c from
        GLIBC.  Just build it with "gcc -o test test.c -lpthread -lrt" or
        similar.
      
        Run it several times, and you will see cases where the main thread
        will measure a process clock difference before and after the nanosleep
        which is smaller than the cpu-burner thread's individual thread clock
        difference.  This doesn't make any sense since the cpu-burner thread
        is part of the top-level process's thread group.
      
        I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
        64-bit binaries).
      
        For example:
      
        [davem@boricha build-x86_64-linux]$ ./test
        process: before(0.001221967) after(0.498624371) diff(497402404)
        thread:  before(0.000081692) after(0.498316431) diff(498234739)
        self:    before(0.001223521) after(0.001240219) diff(16698)
        [davem@boricha build-x86_64-linux]$
      
        The diff of 'process' should always be >= the diff of 'thread'.
      
        I make sure to wrap the 'thread' clock measurements the most tightly
        around the nanosleep() call, and that the 'process' clock measurements
        are the outer-most ones.
      
        ---
        #include <unistd.h>
        #include <stdio.h>
        #include <stdlib.h>
        #include <time.h>
        #include <fcntl.h>
        #include <string.h>
        #include <errno.h>
        #include <pthread.h>
      
        static pthread_barrier_t barrier;
      
        static void *chew_cpu(void *arg)
        {
      	  pthread_barrier_wait(&barrier);
      	  while (1)
      		  __asm__ __volatile__("" : : : "memory");
      	  return NULL;
        }
      
        int main(void)
        {
      	  clockid_t process_clock, my_thread_clock, th_clock;
      	  struct timespec process_before, process_after;
      	  struct timespec me_before, me_after;
      	  struct timespec th_before, th_after;
      	  struct timespec sleeptime;
      	  unsigned long diff;
      	  pthread_t th;
      	  int err;
      
      	  err = clock_getcpuclockid(0, &process_clock);
      	  if (err)
      		  return 1;
      
      	  err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
      	  if (err)
      		  return 1;
      
      	  pthread_barrier_init(&barrier, NULL, 2);
      	  err = pthread_create(&th, NULL, chew_cpu, NULL);
      	  if (err)
      		  return 1;
      
      	  err = pthread_getcpuclockid(th, &th_clock);
      	  if (err)
      		  return 1;
      
      	  pthread_barrier_wait(&barrier);
      
      	  err = clock_gettime(process_clock, &process_before);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(my_thread_clock, &me_before);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(th_clock, &th_before);
      	  if (err)
      		  return 1;
      
      	  sleeptime.tv_sec = 0;
      	  sleeptime.tv_nsec = 500000000;
      	  nanosleep(&sleeptime, NULL);
      
      	  err = clock_gettime(th_clock, &th_after);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(my_thread_clock, &me_after);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(process_clock, &process_after);
      	  if (err)
      		  return 1;
      
      	  diff = process_after.tv_nsec - process_before.tv_nsec;
      	  printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 process_before.tv_sec, process_before.tv_nsec,
      		 process_after.tv_sec, process_after.tv_nsec, diff);
      	  diff = th_after.tv_nsec - th_before.tv_nsec;
      	  printf("thread:  before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 th_before.tv_sec, th_before.tv_nsec,
      		 th_after.tv_sec, th_after.tv_nsec, diff);
      	  diff = me_after.tv_nsec - me_before.tv_nsec;
      	  printf("self:    before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 me_before.tv_sec, me_before.tv_nsec,
      		 me_after.tv_sec, me_after.tv_nsec, diff);
      
      	  return 0;
        }
      
      This is due to us using p->se.sum_exec_runtime in
      thread_group_cputime() where we iterate the thread group and sum all
      data. This does not take time since the last schedule operation (tick
      or otherwise) into account. We can cure this by using
      task_sched_runtime() at the cost of having to take locks.
      
      This also means we can (and must) do away with
      thread_group_sched_runtime() since the modified thread_group_cputime()
      is now more accurate and would deadlock when called from
      thread_group_sched_runtime().
      Reported-by: default avatarDavid Miller <davem@davemloft.net>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins
      Cc: stable@kernel.org
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      e8abccb7
    • Martin Schwidefsky's avatar
      s390: Use direct ktime path for s390 clockevent device · 4f37a68c
      Martin Schwidefsky authored
      The clock comparator on s390 uses the same format as the TOD clock.
      If the value in the clock comparator is smaller than the current TOD
      value an interrupt is pending. Use the CLOCK_EVT_FEAT_KTIME feature
      to get the unmodified ktime of the next clockevent expiration and
      use it to program the clock comparator without querying the TOD clock.
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: john stultz <johnstul@us.ibm.com>
      Link: http://lkml.kernel.org/r/20110823133143.153017933@de.ibm.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      4f37a68c
    • Martin Schwidefsky's avatar
      clockevents: Add direct ktime programming function · 65516f8a
      Martin Schwidefsky authored
      There is at least one architecture (s390) with a sane clockevent device
      that can be programmed with the equivalent of a ktime. No need to create
      a delta against the current time, the ktime can be used directly.
      
      A new clock device function 'set_next_ktime' is introduced that is called
      with the unmodified ktime for the timer if the clock event device has the 
      CLOCK_EVT_FEAT_KTIME bit set.
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: john stultz <johnstul@us.ibm.com>
      Link: http://lkml.kernel.org/r/20110823133142.815350967@de.ibm.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      65516f8a
    • Martin Schwidefsky's avatar
      clockevents: Make minimum delay adjustments configurable · d1748302
      Martin Schwidefsky authored
      The automatic increase of the min_delta_ns of a clockevents device
      should be done in the clockevents code as the minimum delay is an
      attribute of the clockevents device.
      
      In addition not all architectures want the automatic adjustment, on a
      massively virtualized system it can happen that the programming of a
      clock event fails several times in a row because the virtual cpu has
      been rescheduled quickly enough. In that case the minimum delay will
      erroneously be increased with no way back. The new config symbol
      GENERIC_CLOCKEVENTS_MIN_ADJUST is used to enable the automatic
      adjustment. The config option is selected only for x86.
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: john stultz <johnstul@us.ibm.com>
      Link: http://lkml.kernel.org/r/20110823133142.494157493@de.ibm.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      d1748302
    • Heiko Carstens's avatar
      nohz: Remove "Switched to NOHz mode" debugging messages · 29c158e8
      Heiko Carstens authored
      When performing cpu hotplug tests the kernel printk log buffer gets flooded
      with pointless "Switched to NOHz mode..." messages. Especially when afterwards
      analyzing a dump this might have removed more interesting stuff out of the
      buffer.
      Assuming that switching to NOHz mode simply works just remove the printk.
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Link: http://lkml.kernel.org/r/20110823112046.GB2540@osiris.boeblingen.de.ibm.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      29c158e8
    • Michal Hocko's avatar
      proc: Consider NO_HZ when printing idle and iowait times · a25cac51
      Michal Hocko authored
      show_stat handler of the /proc/stat file relies on kstat_cpu(cpu)
      statistics when priting information about idle and iowait times.
      This is OK if we are not using tickless kernel (CONFIG_NO_HZ) because
      counters are updated periodically.
      With NO_HZ things got more tricky because we are not doing idle/iowait
      accounting while we are tickless so the value might get outdated.
      Users of /proc/stat will notice that by unchanged idle/iowait values
      which is then interpreted as 0% idle/iowait time. From the user space
      POV this is an unexpected behavior and a change of the interface.
      
      Let's fix this by using get_cpu_{idle,iowait}_time_us which accounts the
      total idle/iowait time since boot and it doesn't rely on sampling or any
      other periodic activity. Fall back to the previous behavior if NO_HZ is
      disabled or not configured.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Link: http://lkml.kernel.org/r/39181366adac1b39cb6aa3cd53ff0f7c78d32676.1314172057.git.mhocko@suse.czSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      a25cac51
    • Michal Hocko's avatar
      nohz: Make idle/iowait counter update conditional · 09a1d34f
      Michal Hocko authored
      get_cpu_{idle,iowait}_time_us update idle/iowait counters
      unconditionally if the given CPU is in the idle loop.
      
      This doesn't work well outside of CPU governors which are singletons
      so nobody (except for IRQ) can race with them.
      
      We will need to use both functions from /proc/stat handler to properly
      handle nohz idle/iowait times.
      
      Make the update depend on a non NULL last_update_time argument.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Link: http://lkml.kernel.org/r/11f23179472635ce52e78921d47a20216b872f23.1314172057.git.mhocko@suse.czSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      09a1d34f
    • Michal Hocko's avatar
      nohz: Fix update_ts_time_stat idle accounting · 6beea0cd
      Michal Hocko authored
      update_ts_time_stat currently updates idle time even if we are in
      iowait loop at the moment. The only real users of the idle counter
      (via get_cpu_idle_time_us) are CPU governors and they expect to get
      cumulative time for both idle and iowait times.
      The value (idle_sleeptime) is also printed to userspace by print_cpu
      but it prints both idle and iowait times so the idle part is misleading.
      
      Let's clean this up and fix update_ts_time_stat to account both counters
      properly and update consumers of idle to consider iowait time as well.
      If we do this we might use get_cpu_{idle,iowait}_time_us from other
      contexts as well and we will get expected values.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Link: http://lkml.kernel.org/r/e9c909c221a8da402c4da07e4cd968c3218f8eb1.1314172057.git.mhocko@suse.czSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      6beea0cd
    • Michal Hocko's avatar
      cputime: Clean up cputime_to_usecs and usecs_to_cputime macros · ef0e0f5e
      Michal Hocko authored
      Get rid of semicolon so that those expressions can be used also
      somewhere else than just in an assignment.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Link: http://lkml.kernel.org/r/7565417ce30d7e6b1ddc169843af0777dbf66e75.1314172057.git.mhocko@suse.czSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      ef0e0f5e
  12. 10 Aug, 2011 11 commits
  13. 08 Aug, 2011 1 commit
  14. 07 Aug, 2011 8 commits