• Ilya Dryomov's avatar
    libceph: fix double __remove_osd() problem · 55452286
    Ilya Dryomov authored
    commit 7eb71e03 upstream.
    
    It turns out it's possible to get __remove_osd() called twice on the
    same OSD.  That doesn't sit well with rb_erase() - depending on the
    shape of the tree we can get a NULL dereference, a soft lockup or
    a random crash at some point in the future as we end up touching freed
    memory.  One scenario that I was able to reproduce is as follows:
    
                <osd3 is idle, on the osd lru list>
    <con reset - osd3>
    con_fault_finish()
      osd_reset()
                                  <osdmap - osd3 down>
                                  ceph_osdc_handle_map()
                                    <takes map_sem>
                                    kick_requests()
                                      <takes request_mutex>
                                      reset_changed_osds()
                                        __reset_osd()
                                          __remove_osd()
                                      <releases request_mutex>
                                    <releases map_sem>
        <takes map_sem>
        <takes request_mutex>
        __kick_osd_requests()
          __reset_osd()
            __remove_osd() <-- !!!
    
    A case can be made that osd refcounting is imperfect and reworking it
    would be a proper resolution, but for now Sage and I decided to fix
    this by adding a safe guard around __remove_osd().
    
    Fixes: http://tracker.ceph.com/issues/8087
    
    Cc: Sage Weil <sage@redhat.com>
    Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
    Reviewed-by: default avatarSage Weil <sage@redhat.com>
    Reviewed-by: default avatarAlex Elder <elder@linaro.org>
    Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
    55452286
osd_client.c 70.2 KB