1 .. SPDX-License-Identifier: GPL-2.0 2 .. include:: <isonum.txt> 3 4 ============================================== 5 The PCI Express Advanced Error Reporting Drive 6 ============================================== 7 8 :Authors: - T. Long Nguyen <tom.l.nguyen@intel. 9 - Yanmin Zhang <yanmin.zhang@intel.co 10 11 :Copyright: |copy| 2006 Intel Corporation 12 13 Overview 14 =========== 15 16 About this guide 17 ---------------- 18 19 This guide describes the basics of the PCI Exp 20 Reporting (AER) driver and provides informatio 21 well as how to enable the drivers of Endpoint 22 the PCIe AER driver. 23 24 25 What is the PCIe AER Driver? 26 ---------------------------- 27 28 PCIe error signaling can occur on the PCIe lin 29 or on behalf of transactions initiated on the 30 defines two error reporting paradigms: the bas 31 the Advanced Error Reporting capability. The b 32 required of all PCIe components providing a mi 33 set of error reporting requirements. Advanced 34 capability is implemented with a PCIe Advanced 35 extended capability structure providing more r 36 37 The PCIe AER driver provides the infrastructur 38 Error Reporting capability. The PCIe AER drive 39 functions: 40 41 - Gathers the comprehensive error informatio 42 - Reports error to the users. 43 - Performs error recovery actions. 44 45 The AER driver only attaches to Root Ports and 46 AER capability. 47 48 49 User Guide 50 ========== 51 52 Include the PCIe AER Root Driver into the Linu 53 ---------------------------------------------- 54 55 The PCIe AER driver is a Root Port service dri 56 via the PCIe Port Bus driver. If a user wants 57 must be compiled. It is enabled with CONFIG_PC 58 depends on CONFIG_PCIEPORTBUS. 59 60 Load PCIe AER Root Driver 61 ------------------------- 62 63 Some systems have AER support in firmware. Ena 64 the same time the firmware handles AER would r 65 behavior. Therefore, Linux does not handle AER 66 grants AER control to the OS via the ACPI _OSC 67 Specification for details regarding _OSC usage 68 69 AER error output 70 ---------------- 71 72 When a PCIe AER error is captured, an error me 73 console. If it's a correctable error, it is ou 74 Otherwise, it is printed as an error. So users 75 log level to filter out correctable error mess 76 77 Below shows an example:: 78 79 0000:50:00.0: PCIe Bus Error: severity=Uncor 80 0000:50:00.0: device [8086:0329] error sta 81 0000:50:00.0: [20] Unsupported Request 82 0000:50:00.0: TLP Header: 04000001 00200a0 83 84 In the example, 'Requester ID' means the ID of 85 the error message to the Root Port. Please ref 86 fields. 87 88 AER Statistics / Counters 89 ------------------------- 90 91 When PCIe AER errors are captured, the counter 92 in the form of sysfs attributes which are docu 93 Documentation/ABI/testing/sysfs-bus-pci-device 94 95 Developer Guide 96 =============== 97 98 To enable error recovery, a software driver mu 99 100 To support AER better, developers need to unde 101 102 PCIe errors are classified into two types: cor 103 and uncorrectable errors. This classification 104 of those errors, which may result in degraded 105 failure. 106 107 Correctable errors pose no impacts on the func 108 interface. The PCIe protocol can recover witho 109 intervention or any loss of data. These errors 110 corrected by hardware. 111 112 Unlike correctable errors, uncorrectable 113 errors impact functionality of the interface. 114 can cause a particular transaction or a partic 115 to be unreliable. Depending on those error con 116 errors are further classified into non-fatal e 117 Non-fatal errors cause the particular transact 118 but the PCIe link itself is fully functional. 119 the other hand, cause the link to be unreliabl 120 121 When PCIe error reporting is enabled, a device 122 error message to the Root Port above it when i 123 an error. The Root Port, upon receiving an err 124 internally processes and logs the error messag 125 Capability structure. Error information being 126 the error reporting agent's requestor ID into 127 Identification Registers and setting the error 128 Status Register accordingly. If AER error repo 129 Error Command Register, the Root Port generate 130 error is detected. 131 132 Note that the errors as described above are re 133 hierarchy and links. These errors do not inclu 134 errors because device specific errors will sti 135 the device driver. 136 137 Provide callbacks 138 ----------------- 139 140 callback reset_link to reset PCIe link 141 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 142 143 This callback is used to reset the PCIe physic 144 fatal error happens. The Root Port AER service 145 default reset_link function, but different Ups 146 have different specifications to reset the PCI 147 Upstream Port drivers may provide their own re 148 149 Section 3.2.2.2 provides more detailed info on 150 reset_link. 151 152 PCI error-recovery callbacks 153 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 154 155 The PCIe AER Root driver uses error callbacks 156 with downstream device drivers associated with 157 when performing error recovery actions. 158 159 Data struct pci_driver has a pointer, err_hand 160 pci_error_handlers who consists of a couple of 161 pointers. The AER driver follows the rules def 162 pci-error-recovery.rst except PCIe-specific pa 163 reset_link). Please refer to pci-error-recover 164 definitions of the callbacks. 165 166 The sections below specify when to call the er 167 168 Correctable errors 169 ~~~~~~~~~~~~~~~~~~ 170 171 Correctable errors pose no impacts on the func 172 the interface. The PCIe protocol can recover w 173 software intervention or any loss of data. The 174 require any recovery actions. The AER driver c 175 correctable error status register accordingly 176 177 Non-correctable (non-fatal and fatal) errors 178 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 179 180 If an error message indicates a non-fatal erro 181 at upstream is not required. The AER driver ca 182 pci_channel_io_normal) to all drivers associat 183 question. For example:: 184 185 Endpoint <==> Downstream Port B <==> Upstrea 186 187 If Upstream Port A captures an AER error, the 188 Downstream Port B and Endpoint. 189 190 A driver may return PCI_ERS_RESULT_CAN_RECOVER 191 PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_N 192 whether it can recover or the AER driver calls 193 194 If an error message indicates a fatal error, k 195 error_detected(dev, pci_channel_io_frozen) to 196 a hierarchy in question. Then, performing link 197 necessary. As different kinds of devices might 198 to reset link, AER port service driver is requ 199 function to reset link via callback parameter 200 function. If reset_link is not NULL, recovery 201 to reset the link. If error_detected returns P 202 and reset_link returns PCI_ERS_RESULT_RECOVERE 203 to mmio_enabled. 204 205 Frequent Asked Questions 206 ------------------------ 207 208 Q: 209 What happens if a PCIe device driver does no 210 error recovery handler (pci_driver->err_hand 211 212 A: 213 The devices attached with the driver won't b 214 error is fatal, kernel will print out warnin 215 to section 3 for more information. 216 217 Q: 218 What happens if an upstream port service dri 219 callback reset_link? 220 221 A: 222 Fatal error recovery will fail if the errors 223 upstream ports who are attached by the servi 224 225 226 Software error injection 227 ======================== 228 229 Debugging PCIe AER error recovery code is quit 230 is hard to trigger real hardware errors. Softw 231 injection can be used to fake various kinds of 232 233 First you should enable PCIe AER software erro 234 configuration, that is, following item should 235 236 CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJE 237 238 After reboot with new kernel or insert the mod 239 /dev/aer_inject should be created. 240 241 Then, you need a user space tool named aer-inj 242 from: 243 244 https://github.com/intel/aer-inject.git 245 246 More information about aer-inject can be found 247 its source code.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.