timers/migration: Move hierarchy setup into cpuhotplug prepare callback
When a CPU comes online the first time, it is possible that a new top level group will be created. In general all propagation is done from the bottom to top. This minimizes complexity and prevents possible races. But when a new top level group is created, the formely top level group needs to be connected to the new level. This is the only time, when the direction to propagate changes is changed: the changes are propagated from top (new top level group) to bottom (formerly top level group). This introduces two races (see (A) and (B)) as reported by Frederic: (A) This race happens, when marking the formely top level group as active, but the last active CPU of the formerly top level group goes idle. Then it's likely that formerly group is no longer active, but marked nevertheless as active in new top level group: [GRP0:0] migrator = 0 active = 0 nextevt = KTIME_MAX / \ 0 1 .. 7 active idle 0) Hierarchy has for now only 8 CPUs and CPU 0 is the only active CPU. [GRP1:0] migrator = TMIGR_NONE active = NONE nextevt = KTIME_MAX \ [GRP0:0] [GRP0:1] migrator = 0 migrator = TMIGR_NONE active = 0 active = NONE nextevt = KTIME_MAX nextevt = KTIME_MAX / \ 0 1 .. 7 8 active idle !online 1) CPU 8 is booting and creates a new group in first level GRP0:1 and therefore also a new top group GRP1:0. For now the setup code proceeded only until the connected between GRP0:1 to the new top group. The connection between CPU8 and GRP0:1 is not yet established and CPU 8 is still !online. [GRP1:0] migrator = TMIGR_NONE active = NONE nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = 0 migrator = TMIGR_NONE active = 0 active = NONE nextevt = KTIME_MAX nextevt = KTIME_MAX / \ 0 1 .. 7 8 active idle !online 2) Setup code now connects GRP0:0 to GRP1:0 and observes while in tmigr_connect_child_parent() that GRP0:0 is not TMIGR_NONE. So it prepares to call tmigr_active_up() on it. It hasn't done it yet. [GRP1:0] migrator = TMIGR_NONE active = NONE nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = TMIGR_NONE migrator = TMIGR_NONE active = NONE active = NONE nextevt = KTIME_MAX nextevt = KTIME_MAX / \ 0 1 .. 7 8 idle idle !online 3) CPU 0 goes idle. Since GRP0:0->parent has been updated by CPU 8 with GRP0:0->lock held, CPU 0 observes GRP1:0 after calling tmigr_update_events() and it propagates the change to the top (no change there and no wakeup programmed since there is no timer). [GRP1:0] migrator = GRP0:0 active = GRP0:0 nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = TMIGR_NONE migrator = TMIGR_NONE active = NONE active = NONE nextevt = KTIME_MAX nextevt = KTIME_MAX / \ 0 1 .. 7 8 idle idle !online 4) Now the setup code finally calls tmigr_active_up() to and sets GRP0:0 active in GRP1:0 [GRP1:0] migrator = GRP0:0 active = GRP0:0, GRP0:1 nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = TMIGR_NONE migrator = 8 active = NONE active = 8 nextevt = KTIME_MAX nextevt = KTIME_MAX / \ | 0 1 .. 7 8 idle idle active 5) Now CPU 8 is connected with GRP0:1 and CPU 8 calls tmigr_active_up() out of tmigr_cpu_online(). [GRP1:0] migrator = GRP0:0 active = GRP0:0 nextevt = T8 / \ [GRP0:0] [GRP0:1] migrator = TMIGR_NONE migrator = TMIGR_NONE active = NONE active = NONE nextevt = KTIME_MAX nextevt = T8 / \ | 0 1 .. 7 8 idle idle idle 5) CPU 8 goes idle with a timer T8 and relies on GRP0:0 as the migrator. But it's not really active, so T8 gets ignored. --> The update which is done in third step is not noticed by setup code. So a wrong migrator is set to top level group and a timer could get ignored. (B) Reading group->parent and group->childmask when an hierarchy update is ongoing and reaches the formerly top level group is racy as those values could be inconsistent. (The notation of migrator and active now slightly changes in contrast to the above example, as now the childmasks are used.) [GRP1:0] migrator = TMIGR_NONE active = 0x00 nextevt = KTIME_MAX \ [GRP0:0] [GRP0:1] migrator = TMIGR_NONE migrator = TMIGR_NONE active = 0x00 active = 0x00 nextevt = KTIME_MAX nextevt = KTIME_MAX childmask= 0 childmask= 1 parent = NULL parent = GRP1:0 / \ 0 1 .. 7 8 idle idle !online childmask=1 1) Hierarchy has 8 CPUs. CPU 8 is at the moment in the process of onlining but did not yet connect GRP0:0 to GRP1:0. [GRP1:0] migrator = TMIGR_NONE active = 0x00 nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = TMIGR_NONE migrator = TMIGR_NONE active = 0x00 active = 0x00 nextevt = KTIME_MAX nextevt = KTIME_MAX childmask= 0 childmask= 1 parent = GRP1:0 parent = GRP1:0 / \ 0 1 .. 7 8 idle idle !online childmask=1 2) Setup code (running on CPU 8) now connects GRP0:0 to GRP1:0, updates parent pointer of GRP0:0 and ... [GRP1:0] migrator = TMIGR_NONE active = 0x00 nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = 0x01 migrator = TMIGR_NONE active = 0x01 active = 0x00 nextevt = KTIME_MAX nextevt = KTIME_MAX childmask= 0 childmask= 1 parent = GRP1:0 parent = GRP1:0 / \ 0 1 .. 7 8 active idle !online childmask=1 tmigr_walk.childmask = 0 3) ... CPU 0 comes active in the same time. As migrator in GRP0:0 was TMIGR_NONE, childmask of GRP0:0 is stored in update propagation data structure tmigr_walk (as update of childmask is not yet visible/updated). And now ... [GRP1:0] migrator = TMIGR_NONE active = 0x00 nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = 0x01 migrator = TMIGR_NONE active = 0x01 active = 0x00 nextevt = KTIME_MAX nextevt = KTIME_MAX childmask= 2 childmask= 1 parent = GRP1:0 parent = GRP1:0 / \ 0 1 .. 7 8 active idle !online childmask=1 tmigr_walk.childmask = 0 4) ... childmask of GRP0:0 is updated by CPU 8 (still part of setup code). [GRP1:0] migrator = 0x00 active = 0x00 nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = 0x01 migrator = TMIGR_NONE active = 0x01 active = 0x00 nextevt = KTIME_MAX nextevt = KTIME_MAX childmask= 2 childmask= 1 parent = GRP1:0 parent = GRP1:0 / \ 0 1 .. 7 8 active idle !online childmask=1 tmigr_walk.childmask = 0 5) CPU 0 sees the connection to GRP1:0 and now propagates active state to GRP1:0 but with childmask = 0 as stored in propagation data structure. --> Now GRP1:0 always has a migrator as 0x00 != TMIGR_NONE and for all CPUs it looks like GRP1:0 is always active. To prevent those races, the setup of the hierarchy is moved into the cpuhotplug prepare callback. The prepare callback is not executed by the CPU which will come online, it is executed by the CPU which prepares onlining of the other CPU. This CPU is active while it is connecting the formerly top level to the new one. This prevents from (A) to happen and it also prevents from any further walk above the formerly top level until that active CPU becomes inactive, releasing the new ->parent and ->childmask updates to be visible by any subsequent walk up above the formerly top level hierarchy. This prevents from (B) to happen. The direction for the updates is now forced to look like "from bottom to top". However if the active CPU prevents from tmigr_cpu_(in)active() to walk up with the update not-or-half visible, nothing prevents walking up to the new top with a 0 childmask in tmigr_handle_remote_up() or tmigr_requires_handle_remote_up() if the active CPU doing the prepare is not the migrator. But then it looks fine because: * tmigr_check_migrator() should just return false * The migrator is active and should eventually observe the new childmask at some point in a future tick. Split setup functionality of online callback into the cpuhotplug prepare callback and setup hotplug state. Change init call into early_initcall() to make sure an already active CPU prepares everything for newly upcoming CPUs. Reorder the code, that all prepare related functions are close to each other and online and offline callbacks are also close together. Fixes: 7ee98877 ("timers: Implement the hierarchical pull model") Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240717094940.18687-1-anna-maria@linutronix.de
Showing
Please register or sign in to comment