1. 29 Nov, 2022 6 commits
    • Likun Gao's avatar
      drm/amdgpu: skip vram reserve on firmware_v2_2 for bare-metal · c0924ad7
      Likun Gao authored
      vram_usagebyfirmware v2_2 is only used in SRIOV case, skip the related
      settings in bare-metal case currently.
      Signed-off-by: default avatarLikun Gao <Likun.Gao@amd.com>
      Reviewed-by: default avatarFeifei Xu <Feifei.Xu@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      c0924ad7
    • Guchun Chen's avatar
      drm/amdgpu: add printing to indicate rpm completeness · f4b09c29
      Guchun Chen authored
      Add an explicit printing to tell when finishing rpm execution
      in amdgpu.
      Signed-off-by: default avatarGuchun Chen <guchun.chen@amd.com>
      Reviewed-by: default avatarLijo Lazar <lijo.lazar@amd.com>
      Reviewed-by: default avatarEvan Quan <evan.quan@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      f4b09c29
    • Guchun Chen's avatar
      drm/amd/pm/smu11: poll BACO status after RPM BACO exits · 86a3c691
      Guchun Chen authored
      After executing BACO exit, driver needs to poll the status
      to ensure FW has completed BACO exit sequence to prevent
      timing issue.
      
      v2: use usleep_range to replace msleep to fix checkpatch.pl warnings
      Signed-off-by: default avatarGuchun Chen <guchun.chen@amd.com>
      Reviewed-by: default avatarLijo Lazar <lijo.lazar@amd.com>
      Reviewed-by: default avatarEvan Quan <evan.quan@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      86a3c691
    • Guchun Chen's avatar
      drm/amd/pm/smu11: BACO is supported when it's in BACO state · 6dca7efe
      Guchun Chen authored
      Return true early if ASIC is in BACO state already, no need
      to talk to SMU. It can fix the issue that driver was not
      calling BACO exit at all in runtime pm resume, and a timing
      issue leading to a PCI AER error happened eventually.
      
      Fixes: 8795e182 ("PCI/portdrv: Don't disable AER reporting in get_port_device_capability()")
      Suggested-by: default avatarLijo Lazar <lijo.lazar@amd.com>
      Signed-off-by: default avatarGuchun Chen <guchun.chen@amd.com>
      Reviewed-by: default avatarLijo Lazar <lijo.lazar@amd.com>
      Reviewed-by: default avatarEvan Quan <evan.quan@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      6dca7efe
    • Tong Liu01's avatar
      drm/amdgpu: add drv_vram_usage_va for virt data exchange · 6d96ced7
      Tong Liu01 authored
      For vram_usagebyfirmware_v2_2, fw_vram_reserve is not used. So
      fw_vram_usage_va is NULL, and cannot do virt data exchange
      anymore. Should add drv_vram_usage_va to do virt data exchange
      in vram_usagebyfirmware_v2_2 case. And refine some code style
      checks in pre add vram reservation logic patch
      Signed-off-by: default avatarTong Liu01 <Tong.Liu01@amd.com>
      Acked-by: default avatarLuben Tuikov <luben.tuikov@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      6d96ced7
    • James Zhu's avatar
      drm/amdgpu: fix stall on CPU when allocate large system memory · c1420a5d
      James Zhu authored
      -v2: 1. rename variable to redue confuse
           2. optimize the code
      -v3: move new define out of the middle of the code
      -v4: squash in minmax error fix (Luben)
      
      When applications try to allocate large system (more than > 128GB),
      "stall cpu" is reported.
      
      for such large system memory, walk_page_range takes more than 20s usually.
      The warning message can be removed when splitting hmm range into smaller
      ones which is not more 64GB for each walk_page_range.
      
      [  164.437617] amdgpu:amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu:1753: amdgpu: 	create BO VA 0x7f63c7a00000 size 0x2f16000000 domain CPU
      [  164.488847] amdgpu:amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu:1785: amdgpu: creating userptr BO for user_addr = 7f63c7a00000
      [  185.439116] rcu: INFO: rcu_sched self-detected stall on CPU
      [  185.439125] rcu: 	8-....: (20999 ticks this GP) idle=e22/1/0x4000000000000000 softirq=2242/2242 fqs=5249
      [  185.439137] 	(t=21000 jiffies g=6325 q=1215)
      [  185.439141] NMI backtrace for cpu 8
      [  185.439143] CPU: 8 PID: 3470 Comm: kfdtest Kdump: loaded Tainted: G           O      5.12.0-0_fbk5_zion_rc1_5697_g2c723fb88626 #1
      [  185.439147] Hardware name: HPE ProLiant XL675d Gen10 Plus/ProLiant XL675d Gen10 Plus, BIOS A47 11/06/2020
      [  185.439150] Call Trace:
      [  185.439153]  <IRQ>
      [  185.439157]  dump_stack+0x64/0x7c
      [  185.439163]  nmi_cpu_backtrace.cold.7+0x30/0x65
      [  185.439165]  ? lapic_can_unplug_cpu+0x70/0x70
      [  185.439170]  nmi_trigger_cpumask_backtrace+0xf9/0x100
      [  185.439174]  rcu_dump_cpu_stacks+0xc5/0xf5
      [  185.439178]  rcu_sched_clock_irq.cold.97+0x112/0x38c
      [  185.439182]  ? tick_sched_handle.isra.21+0x50/0x50
      [  185.439185]  update_process_times+0x8c/0xc0
      [  185.439189]  tick_sched_timer+0x63/0x70
      [  185.439192]  __hrtimer_run_queues+0xff/0x250
      [  185.439195]  hrtimer_interrupt+0xf4/0x200
      [  185.439199]  __sysvec_apic_timer_interrupt+0x51/0xd0
      [  185.439201]  sysvec_apic_timer_interrupt+0x69/0x90
      [  185.439206]  </IRQ>
      [  185.439207]  asm_sysvec_apic_timer_interrupt+0x12/0x20
      [  185.439211] RIP: 0010:clear_page_rep+0x7/0x10
      [  185.439214] Code: e8 fe 7c 51 00 44 89 e2 48 89 ee 48 89 df e8 60 ff ff ff c6 03 00 5b 5d 41 5c c3 cc cc cc cc cc cc cc cc b9 00 02 00 00 31 c0 <f3> 48 ab c3 0f 1f 44 00 00 31 c0 b9 40 00 00 00 66 0f 1f 84 00 00
      [  185.439218] RSP: 0018:ffffc9000f58f818 EFLAGS: 00000246
      [  185.439220] RAX: 0000000000000000 RBX: 0000000000000881 RCX: 000000000000005c
      [  185.439223] RDX: 0000000000100dca RSI: 0000000000000000 RDI: ffff88a59e0e5d20
      [  185.439225] RBP: ffffea0096783940 R08: ffff888118c35280 R09: ffffea0096783940
      [  185.439227] R10: ffff888000000000 R11: 0000160000000000 R12: ffffea0096783980
      [  185.439228] R13: ffffea0096783940 R14: ffff88b07fdfdd00 R15: 0000000000000000
      [  185.439232]  prep_new_page+0x81/0xc0
      [  185.439236]  get_page_from_freelist+0x13be/0x16f0
      [  185.439240]  ? release_pages+0x16a/0x4a0
      [  185.439244]  __alloc_pages_nodemask+0x1ae/0x340
      [  185.439247]  alloc_pages_vma+0x74/0x1e0
      [  185.439251]  __handle_mm_fault+0xafe/0x1360
      [  185.439255]  handle_mm_fault+0xc3/0x280
      [  185.439257]  hmm_vma_fault.isra.22+0x49/0x90
      [  185.439261]  __walk_page_range+0x692/0x9b0
      [  185.439265]  walk_page_range+0x9b/0x120
      [  185.439269]  hmm_range_fault+0x4f/0x90
      [  185.439274]  amdgpu_hmm_range_get_pages+0x24f/0x260 [amdgpu]
      [  185.439463]  amdgpu_ttm_tt_get_user_pages+0xc2/0x190 [amdgpu]
      [  185.439603]  amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x49f/0x7a0 [amdgpu]
      [  185.439774]  kfd_ioctl_alloc_memory_of_gpu+0xfb/0x410 [amdgpu]
      Signed-off-by: default avatarJames Zhu <James.Zhu@amd.com>
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      c1420a5d
  2. 23 Nov, 2022 34 commits