1 =============== 1 =============== 2 GPU Debugging 2 GPU Debugging 3 =============== 3 =============== 4 4 5 GPUVM Debugging 5 GPUVM Debugging 6 =============== 6 =============== 7 7 8 To aid in debugging GPU virtual memory related 8 To aid in debugging GPU virtual memory related problems, the driver supports a 9 number of options module parameters: 9 number of options module parameters: 10 10 11 `vm_fault_stop` - If non-0, halt the GPU memor 11 `vm_fault_stop` - If non-0, halt the GPU memory controller on a GPU page fault. 12 12 13 `vm_update_mode` - If non-0, use the CPU to up 13 `vm_update_mode` - If non-0, use the CPU to update GPU page tables rather than 14 the GPU. 14 the GPU. 15 15 16 16 17 Decoding a GPUVM Page Fault 17 Decoding a GPUVM Page Fault 18 =========================== 18 =========================== 19 19 20 If you see a GPU page fault in the kernel log, 20 If you see a GPU page fault in the kernel log, you can decode it to figure 21 out what is going wrong in your application. 21 out what is going wrong in your application. A page fault in your kernel 22 log may look something like this: 22 log may look something like this: 23 23 24 :: 24 :: 25 25 26 [gfxhub0] no-retry page fault (src_id:0 ring: 26 [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32777, for process glxinfo pid 2424 thread glxinfo:cs0 pid 2425) 27 in page starting at address 0x0000800102800 27 in page starting at address 0x0000800102800000 from IH client 0x1b (UTCL2) 28 VM_L2_PROTECTION_FAULT_STATUS:0x00301030 28 VM_L2_PROTECTION_FAULT_STATUS:0x00301030 29 Faulty UTCL2 client ID: TCP (0x8) 29 Faulty UTCL2 client ID: TCP (0x8) 30 MORE_FAULTS: 0x0 30 MORE_FAULTS: 0x0 31 WALKER_ERROR: 0x0 31 WALKER_ERROR: 0x0 32 PERMISSION_FAULTS: 0x3 32 PERMISSION_FAULTS: 0x3 33 MAPPING_ERROR: 0x0 33 MAPPING_ERROR: 0x0 34 RW: 0x0 34 RW: 0x0 35 35 36 First you have the memory hub, gfxhub and mmhu 36 First you have the memory hub, gfxhub and mmhub. gfxhub is the memory 37 hub used for graphics, compute, and sdma on so 37 hub used for graphics, compute, and sdma on some chips. mmhub is the 38 memory hub used for multi-media and sdma on so 38 memory hub used for multi-media and sdma on some chips. 39 39 40 Next you have the vmid and pasid. If the vmid 40 Next you have the vmid and pasid. If the vmid is 0, this fault was likely 41 caused by the kernel driver or firmware. If t 41 caused by the kernel driver or firmware. If the vmid is non-0, it is generally 42 a fault in a user application. The pasid is u 42 a fault in a user application. The pasid is used to link a vmid to a system 43 process id. If the process is active when the 43 process id. If the process is active when the fault happens, the process 44 information will be printed. 44 information will be printed. 45 45 46 The GPU virtual address that caused the fault 46 The GPU virtual address that caused the fault comes next. 47 47 48 The client ID indicates the GPU block that cau 48 The client ID indicates the GPU block that caused the fault. 49 Some common client IDs: 49 Some common client IDs: 50 50 51 - CB/DB: The color/depth backend of the graphi 51 - CB/DB: The color/depth backend of the graphics pipe 52 - CPF: Command Processor Frontend 52 - CPF: Command Processor Frontend 53 - CPC: Command Processor Compute 53 - CPC: Command Processor Compute 54 - CPG: Command Processor Graphics 54 - CPG: Command Processor Graphics 55 - TCP/SQC/SQG: Shaders 55 - TCP/SQC/SQG: Shaders 56 - SDMA: SDMA engines 56 - SDMA: SDMA engines 57 - VCN: Video encode/decode engines 57 - VCN: Video encode/decode engines 58 - JPEG: JPEG engines 58 - JPEG: JPEG engines 59 59 60 PERMISSION_FAULTS describe what faults were en 60 PERMISSION_FAULTS describe what faults were encountered: 61 61 62 - bit 0: the PTE was not valid 62 - bit 0: the PTE was not valid 63 - bit 1: the PTE read bit was not set 63 - bit 1: the PTE read bit was not set 64 - bit 2: the PTE write bit was not set 64 - bit 2: the PTE write bit was not set 65 - bit 3: the PTE execute bit was not set 65 - bit 3: the PTE execute bit was not set 66 66 67 Finally, RW, indicates whether the access was 67 Finally, RW, indicates whether the access was a read (0) or a write (1). 68 68 69 In the example above, a shader (cliend id = TC 69 In the example above, a shader (cliend id = TCP) generated a read (RW = 0x0) to 70 an invalid page (PERMISSION_FAULTS = 0x3) at G 70 an invalid page (PERMISSION_FAULTS = 0x3) at GPU virtual address 71 0x0000800102800000. The user can then inspect 71 0x0000800102800000. The user can then inspect their shader code and resource 72 descriptor state to determine what caused the 72 descriptor state to determine what caused the GPU page fault. 73 73 74 UMR 74 UMR 75 === 75 === 76 76 77 `umr <https://gitlab.freedesktop.org/tomstdeni 77 `umr <https://gitlab.freedesktop.org/tomstdenis/umr>`_ is a general purpose 78 GPU debugging and diagnostics tool. Please se 78 GPU debugging and diagnostics tool. Please see the umr 79 `documentation <https://umr.readthedocs.io/en/ 79 `documentation <https://umr.readthedocs.io/en/main/>`_ for more information 80 about its capabilities. 80 about its capabilities.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.