1. 07 Dec, 2006 2 commits
    • Nicolas Pitre's avatar
      [ARM] 3978/1: macro to provide a 63-bit value from a 32-bit hardware counter · 838ccbc3
      Nicolas Pitre authored
      This is done in a completely lockless fashion. Bits 0 to 31 of the count
      are provided by the hardware while bits 32 to 62 are stored in memory.
      The top bit in memory is used to synchronize with the hardware count
      half-period.  When the top bit of both counters (hardware and in memory)
      differ then the memory is updated with a new value, incrementing it when
      the hardware counter wraps around.  Because a word store in memory is
      atomic then the incremented value will always be in synch with the top
      bit indicating to any potential concurrent reader if the value in memory
      is up to date or not wrt the needed increment.  And any race in updating
      the value in memory is harmless as the same value would be stored more
      than once.
      Signed-off-by: default avatarNicolas Pitre <nico@cam.org>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      838ccbc3
    • Nicolas Pitre's avatar
      [ARM] 3611/4: optimize do_div() when divisor is constant · fa4adc61
      Nicolas Pitre authored
      On ARM all divisions have to be performed "manually".  For 64-bit
      divisions that may take more than a hundred cycles in many cases.
      
      With 32-bit divisions gcc already use the recyprocal of constant
      divisors to perform a multiplication, but not with 64-bit divisions.
      
      Since the kernel is increasingly relying upon 64-bit divisions it is
      worth optimizing at least those cases where the divisor is a constant.
      This is what this patch does using plain C code that gets optimized away
      at compile time.
      
      For example, despite the amount of added C code, do_div(x, 10000) now
      produces the following assembly code (where x is assigned to r0-r1):
      
      	adr	r4, .L0
      	ldmia	r4, {r4-r5}
      	umull	r2, r3, r4, r0
      	mov	r2, #0
      	umlal	r3, r2, r5, r0
      	umlal	r3, r2, r4, r1
      	mov	r3, #0
      	umlal	r2, r3, r5, r1
      	mov	r0, r2, lsr #11
      	orr	r0, r0, r3, lsl #21
      	mov	r1, r3, lsr #11
      	...
      .L0:
      	.word	948328779
      	.word	879609302
      
      which is the fastest that can be done for any value of x in that case,
      many times faster than the __do_div64 code (except for the small x value
      space for which the result ends up being zero or a single bit).
      
      The fact that this code is generated inline produces a tiny increase in
      .text size, but not significant compared to the needed code around each
      __do_div64 call site this code is replacing.
      
      The algorithm used has been validated on a 16-bit scale for all possible
      values, and then recodified for 64-bit values.  Furthermore I've been
      running it with the final BUG_ON() uncommented for over two months now
      with no problem.
      
      Note that this new code is compiled with gcc versions 4.0 or later.
      Earlier gcc versions proved themselves too problematic and only the
      original code is used with them.
      Signed-off-by: default avatarNicolas Pitre <nico@cam.org>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      fa4adc61
  2. 29 Nov, 2006 22 commits
  3. 28 Nov, 2006 16 commits