• Daniel Vetter's avatar
    drm/i915: Fix use-after-free in do_switch · acc240d4
    Daniel Vetter authored
    So apparently under ridiculous amounts of memory pressure we can get
    into trouble in do_switch when we try to move the old hw context
    backing storage object onto the active lists.
    
    With list debugging enabled that usually results in us chasing a
    poisoned pointer - which means we've hit upon a vma that has been
    removed from all lrus with list_del (and then deallocated, so it's a
    real use-after free).
    
    Ian Lister has done some great callchain chasing and noticed that we
    can reenter do_switch:
    
    i915_gem_do_execbuffer()
    
    i915_switch_context()
    
    do_switch()
       from = ring->last_context;
       i915_gem_object_pin()
    
          i915_gem_object_bind_to_gtt()
             ret = drm_mm_insert_node_in_range_generic();
             // If the above call fails then it will try i915_gem_evict_something()
             // If that fails it will call i915_gem_evict_everything() ...
    	 i915_gem_evict_everything()
    	    i915_gpu_idle()
    	       i915_switch_context(DEFAULT_CONTEXT)
    
    Like with everything else where the shrinker or eviction code can
    invalidate pointers we need to reload relevant state.
    
    Note that there's no need to recheck whether a context switch is still
    required because:
    
    - Doing a switch to the same context is harmless (besides wasting a
      bit of energy).
    
    - This can only happen with the default context. But since that one's
      pinned we'll never call down into evict_everything under normal
      circumstances. Note that there's a little driver bringup fun
      involved namely that we could recourse into do_switch for the
      initial switch. Atm we're fine since we assign the context pointer
      only after the call to do_switch at driver load or resume time. And
      in the gpu reset case we skip the entire setup sequence (which might
      be a bug on its own, but definitely not this one here).
    
    Cc'ing stable since apparently ChromeOS guys are seeing this in the
    wild (and not just on artificial stress tests), see the reference.
    
    Note that in upstream code doesn't calle evict_everything directly
    from evict_something, that's an extension in this product branch. But
    we can still hit upon this bug (and apparently we do, see the linked
    backtraces). I've noticed this while trying to construct a testcase
    for this bug and utterly failed to provoke it. It looks like we need
    to driver the system squarly into the lowmem wall and provoke the
    shrinker to evict the context object by doing the last-ditch
    evict_everything call.
    
    Aside: There's currently no means to get a badly-fragmenting hw
    context object away from a bad spot in the upstream code. We should
    fix this by at least adding some code to evict_something to handle hw
    contexts.
    
    References: https://code.google.com/p/chromium/issues/detail?id=248191Reported-by: default avatarIan Lister <ian.lister@intel.com>
    Cc: Ian Lister <ian.lister@intel.com>
    Cc: stable@vger.kernel.org
    Cc: Ben Widawsky <benjamin.widawsky@intel.com>
    Cc: Stéphane Marchesin <marcheu@chromium.org>
    Cc: Bloomfield, Jon <jon.bloomfield@intel.com>
    Tested-by: default avatarRafael Barbalho <rafael.barbalho@intel.com>
    Reviewed-by: default avatarIan Lister <ian.lister@intel.com>
    Signed-off-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
    acc240d4
i915_gem_context.c 18.3 KB