1. 31 Jul, 2018 6 commits
    • Jack Morgenstein's avatar
      IB/mlx4: Use 4K pages for kernel QP's WQE buffer · f95ccffc
      Jack Morgenstein authored
      In the current implementation, the driver tries to allocate contiguous
      memory, and if it fails, it falls back to 4K fragmented allocation.
      
      Once the memory is fragmented, the first allocation might take a lot
      of time, and even fail, which can cause connection failures.
      
      This patch changes the logic to always allocate with 4K granularity,
      since it's more robust and more likely to succeed.
      
      This patch was tested with Lustre and no performance degradation
      was observed.
      
      Note: This commit eliminates the "shrinking WQE" feature. This feature
      depended on using vmap to create a virtually contiguous send WQ.
      vmap use was abandoned due to problems with several processors (see the
      commit cited below). As a result, shrinking WQE was available only with
      physically contiguous send WQs. Allocating such send WQs caused the
      problems described above.
      Therefore, as a side effect of eliminating the use of large physically
      contiguous send WQs, the shrinking WQE feature became unavailable.
      
      Warning example:
      worker/20:1: page allocation failure: order:8, mode:0x80d0
      CPU: 20 PID: 513 Comm: kworker/20:1 Tainted: G OE ------------
      Workqueue: ib_cm cm_work_handler [ib_cm]
      Call Trace:
      [<ffffffff81686d81>] dump_stack+0x19/0x1b
      [<ffffffff81186160>] warn_alloc_failed+0x110/0x180
      [<ffffffff8118a954>] __alloc_pages_nodemask+0x9b4/0xba0
      [<ffffffff811ce868>] alloc_pages_current+0x98/0x110
      [<ffffffff81184fae>] __get_free_pages+0xe/0x50
      [<ffffffff8133f6fe>] swiotlb_alloc_coherent+0x5e/0x150
      [<ffffffff81062551>] x86_swiotlb_alloc_coherent+0x41/0x50
      [<ffffffffa056b4c4>] mlx4_buf_direct_alloc.isra.7+0xc4/0x180 [mlx4_core]
      [<ffffffffa056b73b>] mlx4_buf_alloc+0x1bb/0x260 [mlx4_core]
      [<ffffffffa0b15496>] create_qp_common+0x536/0x1000 [mlx4_ib]
      [<ffffffff811c6ef7>] ? dma_pool_free+0xa7/0xd0
      [<ffffffffa0b163c1>] mlx4_ib_create_qp+0x3b1/0xdc0 [mlx4_ib]
      [<ffffffffa0b01bc2>] ? mlx4_ib_create_cq+0x2d2/0x430 [mlx4_ib]
      [<ffffffffa0b21f20>] mlx4_ib_create_qp_wrp+0x10/0x20 [mlx4_ib]
      [<ffffffffa08f152a>] ib_create_qp+0x7a/0x2f0 [ib_core]
      [<ffffffffa06205d4>] rdma_create_qp+0x34/0xb0 [rdma_cm]
      [<ffffffffa08275c9>] kiblnd_create_conn+0xbf9/0x1950 [ko2iblnd]
      [<ffffffffa074077a>] ? cfs_percpt_unlock+0x1a/0xb0 [libcfs]
      [<ffffffffa0835519>] kiblnd_passive_connect+0xa99/0x18c0 [ko2iblnd]
      
      Fixes: 73898db0 ("net/mlx4: Avoid wrong virtual mappings")
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      f95ccffc
    • Jason Gunthorpe's avatar
      IB/uverbs: Add UVERBS_ATTR_FLAGS_IN to the specs language · bccd0622
      Jason Gunthorpe authored
      This clearly indicates that the input is a bitwise combination of values
      in an enum, and identifies which enum contains the definition of the bits.
      
      Special accessors are provided that handle the mandatory validation of the
      allowed bits and enforce the correct type for bitwise flags.
      
      If we had introduced this at the start then the kabi would have uniformly
      used u64 data to pass flags, however today there is a mixture of u64 and
      u32 flags. All places are converted to accept both sizes and the accessor
      fixes it. This allows all existing flags to grow to u64 in future without
      any hassle.
      
      Finally all flags are, by definition, optional. If flags are not passed
      the accessor does not fail, but provides a value of zero.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      bccd0622
    • Bart Van Assche's avatar
      RDMA, core and ULPs: Declare ib_post_send() and ib_post_recv() arguments const · d34ac5cd
      Bart Van Assche authored
      Since neither ib_post_send() nor ib_post_recv() modify the data structure
      their second argument points at, declare that argument const. This change
      makes it necessary to declare the 'bad_wr' argument const too and also to
      modify all ULPs that call ib_post_send(), ib_post_recv() or
      ib_post_srq_recv(). This patch does not change any functionality but makes
      it possible for the compiler to verify whether the
      ib_post_(send|recv|srq_recv) really do not modify the posted work request.
      
      To make this possible, only one cast had to be introduce that casts away
      constness, namely in rpcrdma_post_recvs(). The only way I can think of to
      avoid that cast is to introduce an additional loop in that function or to
      change the data type of bad_wr from struct ib_recv_wr ** into int
      (an index that refers to an element in the work request list). However,
      both approaches would require even more extensive changes than this
      patch.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      d34ac5cd
    • Bart Van Assche's avatar
      IB/mlx5, ib_post_send(), IB_WR_REG_SIG_MR: Do not modify the 'wr' argument · 7bb1fafc
      Bart Van Assche authored
      Since the next patch will constify the wr pointer, do not modify the data
      that pointer points at.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Acked-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      7bb1fafc
    • Bart Van Assche's avatar
      RDMA: Constify the argument of the work request conversion functions · f696bf6d
      Bart Van Assche authored
      When posting a send work request, the work request that is posted is not
      modified by any of the RDMA drivers. Make this explicit by constifying
      most ib_send_wr pointers in RDMA transport drivers.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      f696bf6d
    • Bart Van Assche's avatar
      IB/iser: Inline two work request conversion functions · 3e081b77
      Bart Van Assche authored
      Since the next patch will change the return type of these functions into a
      const pointer and since the iSER driver modifies the work request these
      functions return a pointer two, inline two work request conversion
      function calls. This patch does not change any functionality.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      3e081b77
  2. 27 Jul, 2018 3 commits
  3. 26 Jul, 2018 14 commits
  4. 25 Jul, 2018 13 commits
    • Varsha Rao's avatar
      IB/core: Remove extra parentheses · 076dd53b
      Varsha Rao authored
      Remove unnecessary parentheses to fix the clang warning of extraneous
      parentheses.
      Signed-off-by: default avatarVarsha Rao <rvarsha016@gmail.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      076dd53b
    • Bart Van Assche's avatar
      RDMA/ocrdma: Suppress a compiler warning · 9491a1ed
      Bart Van Assche authored
      This patch avoids that the following compiler warning is reported when
      building with gcc 8 and W=1:
      
      In function 'ocrdma_mbx_get_ctrl_attribs',
          inlined from 'ocrdma_init_hw' at drivers/infiniband/hw/ocrdma/ocrdma_hw.c:3224:11:
      drivers/infiniband/hw/ocrdma/ocrdma_hw.c:1368:3: warning: 'strncpy' output may be truncated copying 31 bytes from a string of length 31 [-Wstringop-truncation]
         strncpy(dev->model_number,
         ^~~~~~~~~~~~~~~~~~~~~~~~~~
          hba_attribs->controller_model_number, 31);
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Acked-by: default avatarSelvin Xavier <selvin.xavier@broadcom.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      9491a1ed
    • Jason Gunthorpe's avatar
      IB/uverbs: Fix locking around struct ib_uverbs_file ucontext · 22fa27fb
      Jason Gunthorpe authored
      We have a parallel unlocked reader and writer with ib_uverbs_get_context()
      vs everything else, and nothing guarantees this works properly.
      
      Audit and fix all of the places that access ucontext to use one of the
      following locking schemes:
      - Call ib_uverbs_get_ucontext() under SRCU and check for failure
      - Access the ucontext through an struct ib_uobject context member
        while holding a READ or WRITE lock on the uobject.
        This value cannot be NULL and has no race.
      - Hold the ucontext_lock and check for ufile->ucontext !NULL
      
      This also re-implements ib_uverbs_get_ucontext() in a way that is safe
      against concurrent ib_uverbs_get_context() and disassociation.
      
      As a side effect, every access to ucontext in the commands is via
      ib_uverbs_get_context() with an error check, or via the uobject, so there
      is no longer any need for the core code to check ucontext on every command
      call. These checks are also removed.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      22fa27fb
    • Jason Gunthorpe's avatar
      IB/mlx5: Use the ucontext from the uobj, not the file · c36ee46d
      Jason Gunthorpe authored
      This approach matches the standard flow of the typical write method that
      relies on the HW object to store the device and the uobject to access the
      ucontext.  Avoids the use of the devx_ufile2uctx in several places will
      make revising the semantics of ib_uverbs_get_ucontext() in the next patch
      simpler.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      c36ee46d
    • Jason Gunthorpe's avatar
      IB/uverbs: Move the FD uobj type struct file allocation to alloc_commit · aba94548
      Jason Gunthorpe authored
      Allocating the struct file during alloc_begin creates this strange
      asymmetry with IDR, where the FD has two krefs pointing at it during the
      pre-commit phase. In particular this makes the abort process for FD very
      strange and confusing.
      
      For instance abort currently calls the type's destroy_object twice, and
      the fops release once if abort is done. This is very counter intuitive. No
      fops should be called until alloc_commit succeeds, and destroy_object
      should only ever be called once.
      
      Moving the struct file allocation to the alloc_commit is now simple, as we
      already support failure of rdma_alloc_commit_uobject, with all the
      required rollback pieces.
      
      This creates an understandable symmetry with IDR and simplifies/fixes the
      abort handling for FD types.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      aba94548
    • Jason Gunthorpe's avatar
      IB/uverbs: Always propagate errors from rdma_alloc_commit_uobject() · 2c96eb7d
      Jason Gunthorpe authored
      The ioctl framework already does this correctly, but the write path did
      not. This is trivially fixed by simply using a standard pattern to return
      uobj_alloc_commit() as the last statement in every function.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      2c96eb7d
    • Jason Gunthorpe's avatar
      IB/uverbs: Rework the locking for cleaning up the ucontext · e951747a
      Jason Gunthorpe authored
      The locking here has always been a bit crazy and spread out, upon some
      careful analysis we can simplify things.
      
      Create a single function uverbs_destroy_ufile_hw() that internally handles
      all locking. This pulls together pieces of this process that were
      sprinkled all over the places into one place, and covers them with one
      lock.
      
      This eliminates several duplicate/confusing locks and makes the control
      flow in ib_uverbs_close() and ib_uverbs_free_hw_resources() extremely
      simple.
      
      Unfortunately we have to keep an extra mutex, ucontext_lock.  This lock is
      logically part of the rwsem and provides the 'down write, fail if write
      locked, wait if read locked' semantic we require.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      e951747a
    • Jason Gunthorpe's avatar
      IB/uverbs: Revise and clarify the rwsem and uobjects_lock · 87064277
      Jason Gunthorpe authored
      Rename 'cleanup_rwsem' to 'hw_destroy_rwsem' which is held across any call
      to the type destroy function (aka 'hw' destroy). The main purpose of this
      lock is to prevent normal add and destroy from running concurrently with
      uverbs_cleanup_ufile()
      
      Since the uobjects list is always manipulated under the 'hw_destroy_rwsem'
      we can eliminate the uobjects_lock in the cleanup function. This allows
      converting that lock to a very simple spinlock with a narrow critical
      section.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      87064277
    • Jason Gunthorpe's avatar
      IB/uverbs: Clarify and revise uverbs_close_fd · e6d5d5dd
      Jason Gunthorpe authored
      The locking requirements here have changed slightly now that we can rely
      on the ib_uverbs_file always existing and containing all the necessary
      locking infrastructure.
      
      That means we can get rid of the cleanup_mutex usage (this was protecting
      the check on !uboj->context).
      
      Otherwise, follow the same pattern that IDR uses for destroy, acquire
      exclusive write access, then call destroy and the undo the 'lookup'.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      e6d5d5dd
    • Jason Gunthorpe's avatar
      IB/uverbs: Revise the placement of get/puts on uobject · 5671f79b
      Jason Gunthorpe authored
      This wasn't wrong, but the placement of two krefs didn't make any
      sense. Follow some simple rules.
      
      - A kref is held inside uobjects_list
      - A kref is held inside the IDR
      - A kref is held inside file->private
      - A stack based kref is passed bettwen alloc_begin and
        alloc_abort/alloc_commit
      
      Any place we destroy one of the above pointers, we stick a put,
      or 'move' the kref into another pointer.
      
      The key functions have sensible semantics:
      - alloc_uobj fully initializes the common members in uobj, including
        the list
      - Get rid of the uverbs_idr_remove_uobj helper since IDR remove
        does require put, but it depends on the situation. Later
        patches will re-consolidate this differently.
      - alloc_abort always consumes the passed kref, done in the type
      - alloc_commit always consumes the passed kref, done in the type
      - rdma_remove_commit_uobject always pairs with a lookup_get
      
      After it is all done the only control flow change is to:
      - move a get from alloc_commit_fd_uobject to rdma_alloc_commit_uobject
      - add a put to remove_commit_idr_uobject
      - Consistenly use rdma_lookup_put in rdma_remove_commit_uobject at
        the right place
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      5671f79b
    • Jason Gunthorpe's avatar
      IB/uverbs: Clarify the kref'ing ordering for alloc_commit · c561c288
      Jason Gunthorpe authored
      The alloc_commit callback makes the uobj visible to other threads,
      and it does so using a 'move' semantic of the uobj kref on the stack
      into the public storage (eg the IDR, uobject list and file_private_data)
      
      Once this is done another thread could start up and trigger deletion
      of the kref. Fortunately cleanup_rwsem happens to prevent this from
      being a bug, but that is a fantastically unclear side effect.
      
      Re-organize things so that alloc_commit is that last thing to touch
      the uobj, get rid of the sneaky implicit dependency on cleanup_rwsem,
      and add a comment reminding that uobj is no longer kref'd after
      alloc_commit.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      c561c288
    • Jason Gunthorpe's avatar
      IB/uverbs: Handle IDR and FD types without truncation · 1250c304
      Jason Gunthorpe authored
      Our ABI for write() uses a s32 for FDs and a u32 for IDRs, but internally
      we ended up implicitly casting these ABI values into an 'int'. For ioctl()
      we use a s64 for FDs and a u64 for IDRs, again casting to an int.
      
      The various casts to int are all missing range checks which can cause
      userspace values that should be considered invalid to be accepted.
      
      Fix this by making the generic lookup routine accept a s64, which does not
      truncate the write API's u32/s32 or the ioctl API's s64. Then push the
      detailed range checking down to the actual type implementations to be
      shared by both interfaces.
      
      Finally, change the copy of the uobj->id to sign extend into a s64, so eg,
      if we ever wish to return a negative value for a FD it is carried
      properly.
      
      This ensures that userspace values are never weirdly interpreted due to
      the various trunctations and everything that is really out of range gets
      an EINVAL.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      1250c304
    • Jason Gunthorpe's avatar
      IB/uverbs: Get rid of null_obj_type · 3df593bf
      Jason Gunthorpe authored
      If the method fails after calling rdma_explicit_destroy (eg if
      copy_to_user faults) then it will trigger a kernel oops:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
      PGD 800000000548d067 P4D 800000000548d067 PUD 54a0067 PMD 0
      SMP PTI
      CPU: 0 PID: 359 Comm: ibv_rc_pingpong Not tainted 4.18.0-rc1+ #28
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      RIP: 0010:          (null)
      Code: Bad RIP value.
      RSP: 0018:ffffc900001a3bf0 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: ffff88000603bd00 RCX: 0000000000000003
      RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88000603bd00
      RBP: 0000000000000001 R08: ffffc900001a3cf8 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffffc900001a3cf0
      R13: 0000000000000000 R14: ffffc900001a3cf0 R15: 0000000000000000
      FS:  00007fb00dda8700(0000) GS:ffff880007c00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffffffffffffffd6 CR3: 000000000548e004 CR4: 00000000003606b0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       ? rdma_lookup_put_uobject+0x22/0x50 [ib_uverbs]
       ? uverbs_finalize_object+0x3b/0x60 [ib_uverbs]
       ? uverbs_finalize_attrs+0x128/0x140 [ib_uverbs]
       ? ib_uverbs_cmd_verbs+0x698/0x7c0 [ib_uverbs]
       ? find_held_lock+0x2d/0x90
       ? __might_fault+0x39/0x90
       ? ib_uverbs_ioctl+0x111/0x1f0 [ib_uverbs]
       ? do_vfs_ioctl+0xa0/0x6d0
       ? trace_hardirqs_on_caller+0xed/0x180
       ? _raw_spin_unlock_irq+0x24/0x40
       ? syscall_trace_enter+0x138/0x1d0
       ? ksys_ioctl+0x35/0x60
       ? __x64_sys_ioctl+0x11/0x20
       ? do_syscall_64+0x5b/0x1c0
       ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This is because the type was replaced with the null_type during explicit
      destroy that cannot complete the destruction.
      
      One of the side effects of replacing the type is to make the object
      handle totally unreachable - so no other command could attempt to use
      it, even though it remains on the uboject list.
      
      We can get the same end result by just fully destroying the object inside
      rdma_explicit_destroy and leaving the caller the residual kref for the
      uobj with no attached HW object, and no presence in the ubojects list.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      3df593bf
  5. 24 Jul, 2018 4 commits