1. 27 May, 2024 1 commit
    • Guenter Roeck's avatar
      drm/nouveau/nvif: Avoid build error due to potential integer overflows · 779aa4d7
      Guenter Roeck authored
      Trying to build parisc:allmodconfig with gcc 12.x or later results
      in the following build error.
      
      drivers/gpu/drm/nouveau/nvif/object.c: In function 'nvif_object_mthd':
      drivers/gpu/drm/nouveau/nvif/object.c:161:9: error:
      	'memcpy' accessing 4294967264 or more bytes at offsets 0 and 32 overlaps 6442450881 bytes at offset -2147483617 [-Werror=restrict]
        161 |         memcpy(data, args->mthd.data, size);
            |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      drivers/gpu/drm/nouveau/nvif/object.c: In function 'nvif_object_ctor':
      drivers/gpu/drm/nouveau/nvif/object.c:298:17: error:
      	'memcpy' accessing 4294967240 or more bytes at offsets 0 and 56 overlaps 6442450833 bytes at offset -2147483593 [-Werror=restrict]
        298 |                 memcpy(data, args->new.data, size);
      
      gcc assumes that 'sizeof(*args) + size' can overflow, which would result
      in the problem.
      
      The problem is not new, only it is now no longer a warning but an error
      since W=1 has been enabled for the drm subsystem and since Werror is
      enabled for test builds.
      
      Rearrange arithmetic and use check_add_overflow() for validating the
      allocation size to avoid the overflow. While at it, split assignments
      out of if conditions.
      
      Fixes: a61ddb43 ("drm: enable (most) W=1 warnings by default across the subsystem")
      Cc: Javier Martinez Canillas <javierm@redhat.com>
      Cc: Jani Nikula <jani.nikula@intel.com>
      Cc: Thomas Zimmermann <tzimmermann@suse.de>
      Cc: Danilo Krummrich <dakr@redhat.com>
      Cc: Maxime Ripard <mripard@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
      Cc: Joe Perches <joe@perches.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarDanilo Krummrich <dakr@redhat.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240524134817.1369993-1-linux@roeck-us.net
      779aa4d7
  2. 24 May, 2024 1 commit
  3. 21 May, 2024 1 commit
  4. 17 May, 2024 2 commits
  5. 13 May, 2024 1 commit
  6. 07 May, 2024 1 commit
  7. 03 May, 2024 2 commits
  8. 02 May, 2024 3 commits
  9. 30 Apr, 2024 3 commits
    • Zack Rusin's avatar
      drm/vmwgfx: Fix invalid reads in fence signaled events · a37ef761
      Zack Rusin authored
      Correctly set the length of the drm_event to the size of the structure
      that's actually used.
      
      The length of the drm_event was set to the parent structure instead of
      to the drm_vmw_event_fence which is supposed to be read. drm_read
      uses the length parameter to copy the event to the user space thus
      resuling in oob reads.
      Signed-off-by: default avatarZack Rusin <zack.rusin@broadcom.com>
      Fixes: 8b7de6aa ("vmwgfx: Rework fence event action")
      Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-23566
      Cc: David Airlie <airlied@gmail.com>
      CC: Daniel Vetter <daniel@ffwll.ch>
      Cc: Zack Rusin <zack.rusin@broadcom.com>
      Cc: Broadcom internal kernel review list <bcm-kernel-feedback-list@broadcom.com>
      Cc: dri-devel@lists.freedesktop.org
      Cc: linux-kernel@vger.kernel.org
      Cc: <stable@vger.kernel.org> # v3.4+
      Reviewed-by: default avatarMaaz Mombasawala <maaz.mombasawala@broadcom.com>
      Reviewed-by: default avatarMartin Krastev <martin.krastev@broadcom.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240425192748.1761522-1-zack.rusin@broadcom.com
      a37ef761
    • Lyude Paul's avatar
      drm/nouveau/gsp: Use the sg allocator for level 2 of radix3 · 6f572a80
      Lyude Paul authored
      Currently we allocate all 3 levels of radix3 page tables using
      nvkm_gsp_mem_ctor(), which uses dma_alloc_coherent() for allocating all of
      the relevant memory. This can end up failing in scenarios where the system
      has very high memory fragmentation, and we can't find enough contiguous
      memory to allocate level 2 of the page table.
      
      Currently, this can result in runtime PM issues on systems where memory
      fragmentation is high - as we'll fail to allocate the page table for our
      suspend/resume buffer:
      
        kworker/10:2: page allocation failure: order:7, mode:0xcc0(GFP_KERNEL),
        nodemask=(null),cpuset=/,mems_allowed=0
        CPU: 10 PID: 479809 Comm: kworker/10:2 Not tainted
        6.8.6-201.ChopperV6.fc39.x86_64 #1
        Hardware name: SLIMBOOK Executive/Executive, BIOS N.1.10GRU06 02/02/2024
        Workqueue: pm pm_runtime_work
        Call Trace:
         <TASK>
         dump_stack_lvl+0x64/0x80
         warn_alloc+0x165/0x1e0
         ? __alloc_pages_direct_compact+0xb3/0x2b0
         __alloc_pages_slowpath.constprop.0+0xd7d/0xde0
         __alloc_pages+0x32d/0x350
         __dma_direct_alloc_pages.isra.0+0x16a/0x2b0
         dma_direct_alloc+0x70/0x270
         nvkm_gsp_radix3_sg+0x5e/0x130 [nouveau]
         r535_gsp_fini+0x1d4/0x350 [nouveau]
         nvkm_subdev_fini+0x67/0x150 [nouveau]
         nvkm_device_fini+0x95/0x1e0 [nouveau]
         nvkm_udevice_fini+0x53/0x70 [nouveau]
         nvkm_object_fini+0xb9/0x240 [nouveau]
         nvkm_object_fini+0x75/0x240 [nouveau]
         nouveau_do_suspend+0xf5/0x280 [nouveau]
         nouveau_pmops_runtime_suspend+0x3e/0xb0 [nouveau]
         pci_pm_runtime_suspend+0x67/0x1e0
         ? __pfx_pci_pm_runtime_suspend+0x10/0x10
         __rpm_callback+0x41/0x170
         ? __pfx_pci_pm_runtime_suspend+0x10/0x10
         rpm_callback+0x5d/0x70
         ? __pfx_pci_pm_runtime_suspend+0x10/0x10
         rpm_suspend+0x120/0x6a0
         pm_runtime_work+0x98/0xb0
         process_one_work+0x171/0x340
         worker_thread+0x27b/0x3a0
         ? __pfx_worker_thread+0x10/0x10
         kthread+0xe5/0x120
         ? __pfx_kthread+0x10/0x10
         ret_from_fork+0x31/0x50
         ? __pfx_kthread+0x10/0x10
         ret_from_fork_asm+0x1b/0x30
      
      Luckily, we don't actually need to allocate coherent memory for the page
      table thanks to being able to pass the GPU a radix3 page table for
      suspend/resume data. So, let's rewrite nvkm_gsp_radix3_sg() to use the sg
      allocator for level 2. We continue using coherent allocations for lvl0 and
      1, since they only take a single page.
      
      V2:
      * Don't forget to actually jump to the next scatterlist when we reach the
        end of the scatterlist we're currently on when writing out the page table
        for level 2
      Signed-off-by: default avatarLyude Paul <lyude@redhat.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarBen Skeggs <bskeggs@nvidia.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240429182318.189668-2-lyude@redhat.com
      6f572a80
    • Lyude Paul's avatar
      drm/nouveau/firmware: Fix SG_DEBUG error with nvkm_firmware_ctor() · 52a6947b
      Lyude Paul authored
      Currently, enabling SG_DEBUG in the kernel will cause nouveau to hit a
      BUG() on startup:
      
        kernel BUG at include/linux/scatterlist.h:187!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 7 PID: 930 Comm: (udev-worker) Not tainted 6.9.0-rc3Lyude-Test+ #30
        Hardware name: MSI MS-7A39/A320M GAMING PRO (MS-7A39), BIOS 1.I0 01/22/2019
        RIP: 0010:sg_init_one+0x85/0xa0
        Code: 69 88 32 01 83 e1 03 f6 c3 03 75 20 a8 01 75 1e 48 09 cb 41 89 54
        24 08 49 89 1c 24 41 89 6c 24 0c 5b 5d 41 5c e9 7b b9 88 00 <0f> 0b 0f 0b
        0f 0b 48 8b 05 5e 46 9a 01 eb b2 66 66 2e 0f 1f 84 00
        RSP: 0018:ffffa776017bf6a0 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffffa77600d87000 RCX: 000000000000002b
        RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffa77680d87000
        RBP: 000000000000e000 R08: 0000000000000000 R09: 0000000000000000
        R10: ffff98f4c46aa508 R11: 0000000000000000 R12: ffff98f4c46aa508
        R13: ffff98f4c46aa008 R14: ffffa77600d4a000 R15: ffffa77600d4a018
        FS:  00007feeb5aae980(0000) GS:ffff98f5c4dc0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f22cb9a4520 CR3: 00000001043ba000 CR4: 00000000003506f0
        Call Trace:
         <TASK>
         ? die+0x36/0x90
         ? do_trap+0xdd/0x100
         ? sg_init_one+0x85/0xa0
         ? do_error_trap+0x65/0x80
         ? sg_init_one+0x85/0xa0
         ? exc_invalid_op+0x50/0x70
         ? sg_init_one+0x85/0xa0
         ? asm_exc_invalid_op+0x1a/0x20
         ? sg_init_one+0x85/0xa0
         nvkm_firmware_ctor+0x14a/0x250 [nouveau]
         nvkm_falcon_fw_ctor+0x42/0x70 [nouveau]
         ga102_gsp_booter_ctor+0xb4/0x1a0 [nouveau]
         r535_gsp_oneinit+0xb3/0x15f0 [nouveau]
         ? srso_return_thunk+0x5/0x5f
         ? srso_return_thunk+0x5/0x5f
         ? nvkm_udevice_new+0x95/0x140 [nouveau]
         ? srso_return_thunk+0x5/0x5f
         ? srso_return_thunk+0x5/0x5f
         ? ktime_get+0x47/0xb0
         ? srso_return_thunk+0x5/0x5f
         nvkm_subdev_oneinit_+0x4f/0x120 [nouveau]
         nvkm_subdev_init_+0x39/0x140 [nouveau]
         ? srso_return_thunk+0x5/0x5f
         nvkm_subdev_init+0x44/0x90 [nouveau]
         nvkm_device_init+0x166/0x2e0 [nouveau]
         nvkm_udevice_init+0x47/0x70 [nouveau]
         nvkm_object_init+0x41/0x1c0 [nouveau]
         nvkm_ioctl_new+0x16a/0x290 [nouveau]
         ? __pfx_nvkm_client_child_new+0x10/0x10 [nouveau]
         ? __pfx_nvkm_udevice_new+0x10/0x10 [nouveau]
         nvkm_ioctl+0x126/0x290 [nouveau]
         nvif_object_ctor+0x112/0x190 [nouveau]
         nvif_device_ctor+0x23/0x60 [nouveau]
         nouveau_cli_init+0x164/0x640 [nouveau]
         nouveau_drm_device_init+0x97/0x9e0 [nouveau]
         ? srso_return_thunk+0x5/0x5f
         ? pci_update_current_state+0x72/0xb0
         ? srso_return_thunk+0x5/0x5f
         nouveau_drm_probe+0x12c/0x280 [nouveau]
         ? srso_return_thunk+0x5/0x5f
         local_pci_probe+0x45/0xa0
         pci_device_probe+0xc7/0x270
         really_probe+0xe6/0x3a0
         __driver_probe_device+0x87/0x160
         driver_probe_device+0x1f/0xc0
         __driver_attach+0xec/0x1f0
         ? __pfx___driver_attach+0x10/0x10
         bus_for_each_dev+0x88/0xd0
         bus_add_driver+0x116/0x220
         driver_register+0x59/0x100
         ? __pfx_nouveau_drm_init+0x10/0x10 [nouveau]
         do_one_initcall+0x5b/0x320
         do_init_module+0x60/0x250
         init_module_from_file+0x86/0xc0
         idempotent_init_module+0x120/0x2b0
         __x64_sys_finit_module+0x5e/0xb0
         do_syscall_64+0x83/0x160
         ? srso_return_thunk+0x5/0x5f
         entry_SYSCALL_64_after_hwframe+0x71/0x79
        RIP: 0033:0x7feeb5cc20cd
        Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89
        f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0
        ff ff 73 01 c3 48 8b 0d 1b cd 0c 00 f7 d8 64 89 01 48
        RSP: 002b:00007ffcf220b2c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
        RAX: ffffffffffffffda RBX: 000055fdd2916aa0 RCX: 00007feeb5cc20cd
        RDX: 0000000000000000 RSI: 000055fdd29161e0 RDI: 0000000000000035
        RBP: 00007ffcf220b380 R08: 00007feeb5d8fb20 R09: 00007ffcf220b310
        R10: 000055fdd2909dc0 R11: 0000000000000246 R12: 000055fdd29161e0
        R13: 0000000000020000 R14: 000055fdd29203e0 R15: 000055fdd2909d80
         </TASK>
      
      We hit this when trying to initialize firmware of type
      NVKM_FIRMWARE_IMG_DMA because we allocate our memory with
      dma_alloc_coherent, and DMA allocations can't be turned back into memory
      pages - which a scatterlist needs in order to map them.
      
      So, fix this by allocating the memory with vmalloc instead().
      
      V2:
      * Fixup explanation as the prior one was bogus
      Signed-off-by: default avatarLyude Paul <lyude@redhat.com>
      Reviewed-by: default avatarDave Airlie <airlied@redhat.com>
      Cc: stable@vger.kernel.org
      Link: https://patchwork.freedesktop.org/patch/msgid/20240429182318.189668-1-lyude@redhat.com
      52a6947b
  10. 29 Apr, 2024 1 commit
  11. 26 Apr, 2024 2 commits
  12. 24 Apr, 2024 1 commit
  13. 19 Apr, 2024 1 commit
  14. 18 Apr, 2024 1 commit
  15. 16 Apr, 2024 2 commits
  16. 15 Apr, 2024 9 commits
    • Lyude Paul's avatar
      drm/nouveau/dp: Don't probe eDP ports twice harder · bf52d7f9
      Lyude Paul authored
      I didn't pay close enough attention the last time I tried to fix this
      problem - while we currently do correctly take care to make sure we don't
      probe a connected eDP port more then once, we don't do the same thing for
      eDP ports we found to be disconnected.
      
      So, fix this and make sure we only ever probe eDP ports once and then leave
      them at that connector state forever (since without HPD, it's not going to
      change on its own anyway). This should get rid of the last few GSP errors
      getting spit out during runtime suspend and resume on some machines, as we
      tried to reprobe eDP ports in response to ACPI hotplug probe events.
      Signed-off-by: default avatarLyude Paul <lyude@redhat.com>
      Reviewed-by: default avatarDave Airlie <airlied@redhat.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240404233736.7946-3-lyude@redhat.com
      (cherry picked from commit fe6660b6)
      bf52d7f9
    • Lyude Paul's avatar
      drm/nouveau/kms/nv50-: Disable AUX bus for disconnected DP ports · ee7e980d
      Lyude Paul authored
      GSP has its own state for keeping track of whether or not a given display
      connector is plugged in or not, and enforces this state on the driver. In
      particular, AUX transactions on a DisplayPort connector which GSP says is
      disconnected can never succeed - and can in some cases even cause
      unexpected timeouts, which can trickle up to cause other problems. A good
      example of this is runtime power management: where we can actually get
      stuck trying to resume the GPU if a userspace application like fwupd tries
      accessing a drm_aux_dev for a disconnected port. This was an issue I hit a
      few times with my Slimbook Executive 16 - where trying to offload something
      to the discrete GPU would wake it up, and then potentially cause it to
      timeout as fwupd tried to immediately access the dp_aux_dev nodes for
      nouveau.
      
      Likewise: we don't really have any cases I know of where we'd want to
      ignore this state and try an aux transaction anyway - and failing pointless
      aux transactions immediately can even speed things up. So - let's start
      enabling/disabling the aux bus in nouveau_dp_detect() to fix this. We
      enable the aux bus during connector probing, and leave it enabled if we
      discover something is actually on the connector. Otherwise, we just shut it
      off.
      
      This should fix some people's runtime PM issues (like myself), and also get
      rid of quite of a lot of GSP error spam in dmesg.
      Signed-off-by: default avatarLyude Paul <lyude@redhat.com>
      Reviewed-by: default avatarDave Airlie <airlied@redhat.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240404233736.7946-2-lyude@redhat.com
      (cherry picked from commit 9c8a10bf)
      ee7e980d
    • Maíra Canal's avatar
      drm/v3d: Don't increment `enabled_ns` twice · 35f4f8c9
      Maíra Canal authored
      The commit 509433d8 ("drm/v3d: Expose the total GPU usage stats on sysfs")
      introduced the calculation of global GPU stats. For the regards, it used
      the already existing infrastructure provided by commit 09a93cc4 ("drm/v3d:
      Implement show_fdinfo() callback for GPU usage stats"). While adding
      global GPU stats calculation ability, the author forgot to delete the
      existing one.
      
      Currently, the value of `enabled_ns` is incremented twice by the end of
      the job, when it should be added just once. Therefore, delete the
      leftovers from commit 509433d8 ("drm/v3d: Expose the total GPU usage
      stats on sysfs").
      
      Fixes: 509433d8 ("drm/v3d: Expose the total GPU usage stats on sysfs")
      Reported-by: default avatarTvrtko Ursulin <tursulin@igalia.com>
      Signed-off-by: default avatarMaíra Canal <mcanal@igalia.com>
      Reviewed-by: default avatarTvrtko Ursulin <tvrtko.ursulin@igalia.com>
      Reviewed-by: default avatarJose Maria Casanova Crespo <jmcasanova@igalia.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240403203517.731876-2-mcanal@igalia.com
      35f4f8c9
    • Zack Rusin's avatar
      drm/vmwgfx: Sort primary plane formats by order of preference · d4c972bf
      Zack Rusin authored
      The table of primary plane formats wasn't sorted at all, leading to
      applications picking our least desirable formats by defaults.
      
      Sort the primary plane formats according to our order of preference.
      
      Nice side-effect of this change is that it makes IGT's kms_atomic
      plane-invalid-params pass because the test picks the first format
      which for vmwgfx was DRM_FORMAT_XRGB1555 and uses fb's with odd sizes
      which make Pixman, which IGT depends on assert due to the fact that our
      16bpp formats aren't 32 bit aligned like Pixman requires all formats
      to be.
      Signed-off-by: default avatarZack Rusin <zack.rusin@broadcom.com>
      Fixes: 36cc79bc ("drm/vmwgfx: Add universal plane support")
      Cc: Broadcom internal kernel review list <bcm-kernel-feedback-list@broadcom.com>
      Cc: dri-devel@lists.freedesktop.org
      Cc: <stable@vger.kernel.org> # v4.12+
      Acked-by: default avatarPekka Paalanen <pekka.paalanen@collabora.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240412025511.78553-6-zack.rusin@broadcom.com
      d4c972bf
    • Zack Rusin's avatar
      drm/vmwgfx: Fix crtc's atomic check conditional · a60ccade
      Zack Rusin authored
      The conditional was supposed to prevent enabling of a crtc state
      without a set primary plane. Accidently it also prevented disabling
      crtc state with a set primary plane. Neither is correct.
      
      Fix the conditional and just driver-warn when a crtc state has been
      enabled without a primary plane which will help debug broken userspace.
      
      Fixes IGT's kms_atomic_interruptible and kms_atomic_transition tests.
      Signed-off-by: default avatarZack Rusin <zack.rusin@broadcom.com>
      Fixes: 06ec4190 ("drm/vmwgfx: Add and connect CRTC helper functions")
      Cc: Broadcom internal kernel review list <bcm-kernel-feedback-list@broadcom.com>
      Cc: dri-devel@lists.freedesktop.org
      Cc: <stable@vger.kernel.org> # v4.12+
      Reviewed-by: default avatarIan Forbes <ian.forbes@broadcom.com>
      Reviewed-by: default avatarMartin Krastev <martin.krastev@broadcom.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240412025511.78553-5-zack.rusin@broadcom.com
      a60ccade
    • Zack Rusin's avatar
      drm/vmwgfx: Fix prime import/export · b32233ac
      Zack Rusin authored
      vmwgfx never supported prime import of external buffers. Furthermore the
      driver exposes two different objects to userspace: vmw_surface's and
      gem buffers but prime import/export only worked with vmw_surfaces.
      
      Because gem buffers are used through the dumb_buffer interface this meant
      that the driver created buffers couldn't have been prime exported or
      imported.
      
      Fix prime import/export. Makes IGT's kms_prime pass.
      Signed-off-by: default avatarZack Rusin <zack.rusin@broadcom.com>
      Fixes: 8afa13a0 ("drm/vmwgfx: Implement DRIVER_GEM")
      Cc: <stable@vger.kernel.org> # v6.6+
      Reviewed-by: default avatarMartin Krastev <martin.krastev@broadcom.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240412025511.78553-4-zack.rusin@broadcom.com
      b32233ac
    • Christian König's avatar
      drm/ttm: stop pooling cached NUMA pages v2 · b6976f32
      Christian König authored
      We only pool write combined and uncached allocations because they
      require extra overhead on allocation and release.
      
      If we also pool cached NUMA it not only means some extra unnecessary
      overhead, but also that under memory pressure it can happen that
      pages from the wrong NUMA node enters the pool and are re-used
      over and over again.
      
      This can lead to performance reduction after running into memory
      pressure.
      
      v2: restructure and cleanup the code a bit from the internal hack to
          test this.
      Signed-off-by: default avatarChristian König <christian.koenig@amd.com>
      Fixes: 4482d3c9 ("drm/ttm: add NUMA node id to the pool")
      CC: stable@vger.kernel.org
      Reviewed-by: default avatarFelix Kuehling <felix.kuehling@amd.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240415134821.1919-1-christian.koenig@amd.com
      b6976f32
    • Mikhail Kobuk's avatar
      drm: nv04: Fix out of bounds access · cf92bb77
      Mikhail Kobuk authored
      When Output Resource (dcb->or) value is assigned in
      fabricate_dcb_output(), there may be out of bounds access to
      dac_users array in case dcb->or is zero because ffs(dcb->or) is
      used as index there.
      The 'or' argument of fabricate_dcb_output() must be interpreted as a
      number of bit to set, not value.
      
      Utilize macros from 'enum nouveau_or' in calls instead of hardcoding.
      
      Found by Linux Verification Center (linuxtesting.org) with SVACE.
      
      Fixes: 2e5702af ("drm/nouveau: fabricate DCB encoder table for iMac G4")
      Fixes: 670820c0 ("drm/nouveau: Workaround incorrect DCB entry on a GeForce3 Ti 200.")
      Signed-off-by: default avatarMikhail Kobuk <m.kobuk@ispras.ru>
      Signed-off-by: default avatarDanilo Krummrich <dakr@redhat.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240411110854.16701-1-m.kobuk@ispras.ru
      cf92bb77
    • Dave Airlie's avatar
      nouveau: fix instmem race condition around ptr stores · fff1386c
      Dave Airlie authored
      Running a lot of VK CTS in parallel against nouveau, once every
      few hours you might see something like this crash.
      
      BUG: kernel NULL pointer dereference, address: 0000000000000008
      PGD 8000000114e6e067 P4D 8000000114e6e067 PUD 109046067 PMD 0
      Oops: 0000 [#1] PREEMPT SMP PTI
      CPU: 7 PID: 53891 Comm: deqp-vk Not tainted 6.8.0-rc6+ #27
      Hardware name: Gigabyte Technology Co., Ltd. Z390 I AORUS PRO WIFI/Z390 I AORUS PRO WIFI-CF, BIOS F8 11/05/2021
      RIP: 0010:gp100_vmm_pgt_mem+0xe3/0x180 [nouveau]
      Code: c7 48 01 c8 49 89 45 58 85 d2 0f 84 95 00 00 00 41 0f b7 46 12 49 8b 7e 08 89 da 42 8d 2c f8 48 8b 47 08 41 83 c7 01 48 89 ee <48> 8b 40 08 ff d0 0f 1f 00 49 8b 7e 08 48 89 d9 48 8d 75 04 48 c1
      RSP: 0000:ffffac20c5857838 EFLAGS: 00010202
      RAX: 0000000000000000 RBX: 00000000004d8001 RCX: 0000000000000001
      RDX: 00000000004d8001 RSI: 00000000000006d8 RDI: ffffa07afe332180
      RBP: 00000000000006d8 R08: ffffac20c5857ad0 R09: 0000000000ffff10
      R10: 0000000000000001 R11: ffffa07af27e2de0 R12: 000000000000001c
      R13: ffffac20c5857ad0 R14: ffffa07a96fe9040 R15: 000000000000001c
      FS:  00007fe395eed7c0(0000) GS:ffffa07e2c980000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000008 CR3: 000000011febe001 CR4: 00000000003706f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
      
      ...
      
       ? gp100_vmm_pgt_mem+0xe3/0x180 [nouveau]
       ? gp100_vmm_pgt_mem+0x37/0x180 [nouveau]
       nvkm_vmm_iter+0x351/0xa20 [nouveau]
       ? __pfx_nvkm_vmm_ref_ptes+0x10/0x10 [nouveau]
       ? __pfx_gp100_vmm_pgt_mem+0x10/0x10 [nouveau]
       ? __pfx_gp100_vmm_pgt_mem+0x10/0x10 [nouveau]
       ? __lock_acquire+0x3ed/0x2170
       ? __pfx_gp100_vmm_pgt_mem+0x10/0x10 [nouveau]
       nvkm_vmm_ptes_get_map+0xc2/0x100 [nouveau]
       ? __pfx_nvkm_vmm_ref_ptes+0x10/0x10 [nouveau]
       ? __pfx_gp100_vmm_pgt_mem+0x10/0x10 [nouveau]
       nvkm_vmm_map_locked+0x224/0x3a0 [nouveau]
      
      Adding any sort of useful debug usually makes it go away, so I hand
      wrote the function in a line, and debugged the asm.
      
      Every so often pt->memory->ptrs is NULL. This ptrs ptr is set in
      the nv50_instobj_acquire called from nvkm_kmap.
      
      If Thread A and Thread B both get to nv50_instobj_acquire around
      the same time, and Thread A hits the refcount_set line, and in
      lockstep thread B succeeds at refcount_inc_not_zero, there is a
      chance the ptrs value won't have been stored since refcount_set
      is unordered. Force a memory barrier here, I picked smp_mb, since
      we want it on all CPUs and it's write followed by a read.
      
      v2: use paired smp_rmb/smp_wmb.
      
      Cc: <stable@vger.kernel.org>
      Fixes: be55287a ("drm/nouveau/imem/nv50: embed nvkm_instobj directly into nv04_instobj")
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      Signed-off-by: default avatarDanilo Krummrich <dakr@redhat.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240411011510.2546857-1-airlied@gmail.com
      fff1386c
  17. 09 Apr, 2024 1 commit
  18. 08 Apr, 2024 7 commits