1. 29 Nov, 2022 3 commits
    • Guchun Chen's avatar
      drm/amd/pm/smu11: BACO is supported when it's in BACO state · 6dca7efe
      Guchun Chen authored
      Return true early if ASIC is in BACO state already, no need
      to talk to SMU. It can fix the issue that driver was not
      calling BACO exit at all in runtime pm resume, and a timing
      issue leading to a PCI AER error happened eventually.
      
      Fixes: 8795e182 ("PCI/portdrv: Don't disable AER reporting in get_port_device_capability()")
      Suggested-by: default avatarLijo Lazar <lijo.lazar@amd.com>
      Signed-off-by: default avatarGuchun Chen <guchun.chen@amd.com>
      Reviewed-by: default avatarLijo Lazar <lijo.lazar@amd.com>
      Reviewed-by: default avatarEvan Quan <evan.quan@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      6dca7efe
    • Tong Liu01's avatar
      drm/amdgpu: add drv_vram_usage_va for virt data exchange · 6d96ced7
      Tong Liu01 authored
      For vram_usagebyfirmware_v2_2, fw_vram_reserve is not used. So
      fw_vram_usage_va is NULL, and cannot do virt data exchange
      anymore. Should add drv_vram_usage_va to do virt data exchange
      in vram_usagebyfirmware_v2_2 case. And refine some code style
      checks in pre add vram reservation logic patch
      Signed-off-by: default avatarTong Liu01 <Tong.Liu01@amd.com>
      Acked-by: default avatarLuben Tuikov <luben.tuikov@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      6d96ced7
    • James Zhu's avatar
      drm/amdgpu: fix stall on CPU when allocate large system memory · c1420a5d
      James Zhu authored
      -v2: 1. rename variable to redue confuse
           2. optimize the code
      -v3: move new define out of the middle of the code
      -v4: squash in minmax error fix (Luben)
      
      When applications try to allocate large system (more than > 128GB),
      "stall cpu" is reported.
      
      for such large system memory, walk_page_range takes more than 20s usually.
      The warning message can be removed when splitting hmm range into smaller
      ones which is not more 64GB for each walk_page_range.
      
      [  164.437617] amdgpu:amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu:1753: amdgpu: 	create BO VA 0x7f63c7a00000 size 0x2f16000000 domain CPU
      [  164.488847] amdgpu:amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu:1785: amdgpu: creating userptr BO for user_addr = 7f63c7a00000
      [  185.439116] rcu: INFO: rcu_sched self-detected stall on CPU
      [  185.439125] rcu: 	8-....: (20999 ticks this GP) idle=e22/1/0x4000000000000000 softirq=2242/2242 fqs=5249
      [  185.439137] 	(t=21000 jiffies g=6325 q=1215)
      [  185.439141] NMI backtrace for cpu 8
      [  185.439143] CPU: 8 PID: 3470 Comm: kfdtest Kdump: loaded Tainted: G           O      5.12.0-0_fbk5_zion_rc1_5697_g2c723fb88626 #1
      [  185.439147] Hardware name: HPE ProLiant XL675d Gen10 Plus/ProLiant XL675d Gen10 Plus, BIOS A47 11/06/2020
      [  185.439150] Call Trace:
      [  185.439153]  <IRQ>
      [  185.439157]  dump_stack+0x64/0x7c
      [  185.439163]  nmi_cpu_backtrace.cold.7+0x30/0x65
      [  185.439165]  ? lapic_can_unplug_cpu+0x70/0x70
      [  185.439170]  nmi_trigger_cpumask_backtrace+0xf9/0x100
      [  185.439174]  rcu_dump_cpu_stacks+0xc5/0xf5
      [  185.439178]  rcu_sched_clock_irq.cold.97+0x112/0x38c
      [  185.439182]  ? tick_sched_handle.isra.21+0x50/0x50
      [  185.439185]  update_process_times+0x8c/0xc0
      [  185.439189]  tick_sched_timer+0x63/0x70
      [  185.439192]  __hrtimer_run_queues+0xff/0x250
      [  185.439195]  hrtimer_interrupt+0xf4/0x200
      [  185.439199]  __sysvec_apic_timer_interrupt+0x51/0xd0
      [  185.439201]  sysvec_apic_timer_interrupt+0x69/0x90
      [  185.439206]  </IRQ>
      [  185.439207]  asm_sysvec_apic_timer_interrupt+0x12/0x20
      [  185.439211] RIP: 0010:clear_page_rep+0x7/0x10
      [  185.439214] Code: e8 fe 7c 51 00 44 89 e2 48 89 ee 48 89 df e8 60 ff ff ff c6 03 00 5b 5d 41 5c c3 cc cc cc cc cc cc cc cc b9 00 02 00 00 31 c0 <f3> 48 ab c3 0f 1f 44 00 00 31 c0 b9 40 00 00 00 66 0f 1f 84 00 00
      [  185.439218] RSP: 0018:ffffc9000f58f818 EFLAGS: 00000246
      [  185.439220] RAX: 0000000000000000 RBX: 0000000000000881 RCX: 000000000000005c
      [  185.439223] RDX: 0000000000100dca RSI: 0000000000000000 RDI: ffff88a59e0e5d20
      [  185.439225] RBP: ffffea0096783940 R08: ffff888118c35280 R09: ffffea0096783940
      [  185.439227] R10: ffff888000000000 R11: 0000160000000000 R12: ffffea0096783980
      [  185.439228] R13: ffffea0096783940 R14: ffff88b07fdfdd00 R15: 0000000000000000
      [  185.439232]  prep_new_page+0x81/0xc0
      [  185.439236]  get_page_from_freelist+0x13be/0x16f0
      [  185.439240]  ? release_pages+0x16a/0x4a0
      [  185.439244]  __alloc_pages_nodemask+0x1ae/0x340
      [  185.439247]  alloc_pages_vma+0x74/0x1e0
      [  185.439251]  __handle_mm_fault+0xafe/0x1360
      [  185.439255]  handle_mm_fault+0xc3/0x280
      [  185.439257]  hmm_vma_fault.isra.22+0x49/0x90
      [  185.439261]  __walk_page_range+0x692/0x9b0
      [  185.439265]  walk_page_range+0x9b/0x120
      [  185.439269]  hmm_range_fault+0x4f/0x90
      [  185.439274]  amdgpu_hmm_range_get_pages+0x24f/0x260 [amdgpu]
      [  185.439463]  amdgpu_ttm_tt_get_user_pages+0xc2/0x190 [amdgpu]
      [  185.439603]  amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x49f/0x7a0 [amdgpu]
      [  185.439774]  kfd_ioctl_alloc_memory_of_gpu+0xfb/0x410 [amdgpu]
      Signed-off-by: default avatarJames Zhu <James.Zhu@amd.com>
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      c1420a5d
  2. 23 Nov, 2022 37 commits