• Thomas Gleixner's avatar
    x86: hpet: Work around hardware stupidity · 54ff7e59
    Thomas Gleixner authored
    This more or less reverts commits 08be9796 (x86: Force HPET
    readback_cmp for all ATI chipsets) and 30a564be (x86, hpet: Restrict
    read back to affected ATI chipsets) to the status of commit 8da854cb
    (x86, hpet: Erratum workaround for read after write of HPET
    comparator).
    
    The delta to commit 8da854cb is mostly comments and the change from
    WARN_ONCE to printk_once as we know the call path of this function
    already.
    
    This needs really in depth explanation:
    
    First of all the HPET design is a complete failure. Having a counter
    compare register which generates an interrupt on matching values
    forces the software to do at least one superfluous readback of the
    counter register.
    
    While it is nice in theory to program "absolute" time events it is
    practically useless because the timer runs at some absurd frequency
    which can never be matched to real world units. So we are forced to
    calculate a relative delta and this forces a readout of the actual
    counter value, adding the delta and programming the compare
    register. When the delta is small enough we run into the danger that
    we program a compare value which is already in the past. Due to the
    compare for equal nature of HPET we need to read back the counter
    value after writing the compare rehgister (btw. this is necessary for
    absolute timeouts as well) to make sure that we did not miss the timer
    event. We try to work around that by setting the minimum delta to a
    value which is larger than the theoretical time which elapses between
    the counter readout and the compare register write, but that's only
    true in theory. A NMI or SMI which hits between the readout and the
    write can easily push us beyond that limit. This would result in
    waiting for the next HPET timer interrupt until the 32bit wraparound
    of the counter happens which takes about 306 seconds.
    
    So we designed the next event function to look like:
    
       match = read_cnt() + delta;
       write_compare_ref(match);
       return read_cnt() < match ? 0 : -ETIME;
    
    At some point we got into trouble with certain ATI chipsets. Even the
    above "safe" procedure failed. The reason was that the write to the
    compare register was delayed probably for performance reasons. The
    theory was that they wanted to avoid the synchronization of the write
    with the HPET clock, which is understandable. So the write does not
    hit the compare register directly instead it goes to some intermediate
    register which is copied to the real compare register in sync with the
    HPET clock. That opens another window for hitting the dreaded "wait
    for a wraparound" problem.
    
    To work around that "optimization" we added a read back of the compare
    register which either enforced the update of the just written value or
    just delayed the readout of the counter enough to avoid the issue. We
    unfortunately never got any affirmative info from ATI/AMD about this.
    
    One thing is sure, that we nuked the performance "optimization" that
    way completely and I'm pretty sure that the result is worse than
    before some HW folks came up with those.
    
    Just for paranoia reasons I added a check whether the read back
    compare register value was the same as the value we wrote right
    before. That paranoia check triggered a couple of years after it was
    added on an Intel ICH9 chipset. Venki added a workaround (commit
    8da854cb) which was reading the compare register twice when the first
    check failed. We considered this to be a penalty in general and
    restricted the readback (thus the wasted CPU cycles) to the known to
    be affected ATI chipsets.
    
    This turned out to be a utterly wrong decision. 2.6.35 testers
    experienced massive problems and finally one of them bisected it down
    to commit 30a564be which spured some further investigation.
    
    Finally we got confirmation that the write to the compare register can
    be delayed by up to two HPET clock cycles which explains the problems
    nicely. All we can do about this is to go back to Venki's initial
    workaround in a slightly modified version.
    
    Just for the record I need to say, that all of this could have been
    avoided if hardware designers and of course the HPET committee would
    have thought about the consequences for a split second. It's out of my
    comprehension why designing a working timer is so hard. There are two
    ways to achieve it:
    
     1) Use a counter wrap around aware compare_reg <= counter_reg
        implementation instead of the easy compare_reg == counter_reg
    
        Downsides:
    
    	- It needs more silicon.
    
    	- It needs a readout of the counter to apply a relative
    	  timeout. This is necessary as the counter does not run in
    	  any useful (and adjustable) frequency and there is no
    	  guarantee that the counter which is used for timer events is
    	  the same which is used for reading the actual time (and
    	  therefor for calculating the delta)
    
        Upsides:
    
    	- None
    
      2) Use a simple down counter for relative timer events
    
        Downsides:
    
    	- Absolute timeouts are not possible, which is not a problem
    	  at all in the context of an OS and the expected
    	  max. latencies/jitter (also see Downsides of #1)
    
       Upsides:
    
    	- It needs less or equal silicon.
    
    	- It works ALWAYS
    
    	- It is way faster than a compare register based solution (One
    	  write versus one write plus at least one and up to four
    	  reads)
    
    I would not be so grumpy about all of this, if I would not have been
    ignored for many years when pointing out these flaws to various
    hardware folks. I really hate timers (at least those which seem to be
    designed by janitors).
    
    Though finally we got a reasonable explanation plus a solution and I
    want to thank all the folks involved in chasing it down and providing
    valuable input to this.
    Bisected-by: default avatarNix <nix@esperi.org.uk>
    Reported-by: default avatarArtur Skawina <art.08.09@gmail.com>
    Reported-by: default avatarDamien Wyart <damien.wyart@free.fr>
    Reported-by: default avatarJohn Drescher <drescherjm@gmail.com>
    Cc: Venkatesh Pallipadi <venki@google.com>
    Cc: Ingo Molnar <mingo@elte.hu>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Arjan van de Ven <arjan@linux.intel.com>
    Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
    Cc: Borislav Petkov <borislav.petkov@amd.com>
    Cc: stable@kernel.org
    Acked-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
    Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
    54ff7e59
hpet.h 3.27 KB