1 .. SPDX-License-Identifier: GPL-2.0 1 .. SPDX-License-Identifier: GPL-2.0 2 .. include:: <isonum.txt> 2 .. include:: <isonum.txt> 3 3 4 ============================================== 4 =========================================================== 5 The PCI Express Advanced Error Reporting Drive 5 The PCI Express Advanced Error Reporting Driver Guide HOWTO 6 ============================================== 6 =========================================================== 7 7 8 :Authors: - T. Long Nguyen <tom.l.nguyen@intel. 8 :Authors: - T. Long Nguyen <tom.l.nguyen@intel.com> 9 - Yanmin Zhang <yanmin.zhang@intel.co 9 - Yanmin Zhang <yanmin.zhang@intel.com> 10 10 11 :Copyright: |copy| 2006 Intel Corporation 11 :Copyright: |copy| 2006 Intel Corporation 12 12 13 Overview 13 Overview 14 =========== 14 =========== 15 15 16 About this guide 16 About this guide 17 ---------------- 17 ---------------- 18 18 19 This guide describes the basics of the PCI Exp !! 19 This guide describes the basics of the PCI Express Advanced Error 20 Reporting (AER) driver and provides informatio 20 Reporting (AER) driver and provides information on how to use it, as 21 well as how to enable the drivers of Endpoint !! 21 well as how to enable the drivers of endpoint devices to conform with 22 the PCIe AER driver. !! 22 PCI Express AER driver. 23 23 24 24 25 What is the PCIe AER Driver? !! 25 What is the PCI Express AER Driver? 26 ---------------------------- !! 26 ----------------------------------- 27 27 28 PCIe error signaling can occur on the PCIe lin !! 28 PCI Express error signaling can occur on the PCI Express link itself 29 or on behalf of transactions initiated on the !! 29 or on behalf of transactions initiated on the link. PCI Express 30 defines two error reporting paradigms: the bas 30 defines two error reporting paradigms: the baseline capability and 31 the Advanced Error Reporting capability. The b 31 the Advanced Error Reporting capability. The baseline capability is 32 required of all PCIe components providing a mi !! 32 required of all PCI Express components providing a minimum defined 33 set of error reporting requirements. Advanced 33 set of error reporting requirements. Advanced Error Reporting 34 capability is implemented with a PCIe Advanced !! 34 capability is implemented with a PCI Express advanced error reporting 35 extended capability structure providing more r 35 extended capability structure providing more robust error reporting. 36 36 37 The PCIe AER driver provides the infrastructur !! 37 The PCI Express AER driver provides the infrastructure to support PCI 38 Error Reporting capability. The PCIe AER drive !! 38 Express Advanced Error Reporting capability. The PCI Express AER 39 functions: !! 39 driver provides three basic functions: 40 40 41 - Gathers the comprehensive error informatio 41 - Gathers the comprehensive error information if errors occurred. 42 - Reports error to the users. 42 - Reports error to the users. 43 - Performs error recovery actions. 43 - Performs error recovery actions. 44 44 45 The AER driver only attaches to Root Ports and !! 45 AER driver only attaches root ports which support PCI-Express AER 46 AER capability. !! 46 capability. 47 47 48 48 49 User Guide 49 User Guide 50 ========== 50 ========== 51 51 52 Include the PCIe AER Root Driver into the Linu !! 52 Include the PCI Express AER Root Driver into the Linux Kernel 53 ---------------------------------------------- !! 53 ------------------------------------------------------------- 54 54 55 The PCIe AER driver is a Root Port service dri !! 55 The PCI Express AER Root driver is a Root Port service driver attached 56 via the PCIe Port Bus driver. If a user wants !! 56 to the PCI Express Port Bus driver. If a user wants to use it, the driver 57 must be compiled. It is enabled with CONFIG_PC !! 57 has to be compiled. Option CONFIG_PCIEAER supports this capability. It 58 depends on CONFIG_PCIEPORTBUS. !! 58 depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and >> 59 CONFIG_PCIEAER = y. 59 60 60 Load PCIe AER Root Driver !! 61 Load PCI Express AER Root Driver 61 ------------------------- !! 62 -------------------------------- 62 63 63 Some systems have AER support in firmware. Ena 64 Some systems have AER support in firmware. Enabling Linux AER support at 64 the same time the firmware handles AER would r !! 65 the same time the firmware handles AER may result in unpredictable 65 behavior. Therefore, Linux does not handle AER 66 behavior. Therefore, Linux does not handle AER events unless the firmware 66 grants AER control to the OS via the ACPI _OSC !! 67 grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0 67 Specification for details regarding _OSC usage 68 Specification for details regarding _OSC usage. 68 69 69 AER error output 70 AER error output 70 ---------------- 71 ---------------- 71 72 72 When a PCIe AER error is captured, an error me 73 When a PCIe AER error is captured, an error message will be output to 73 console. If it's a correctable error, it is ou !! 74 console. If it's a correctable error, it is output as a warning. 74 Otherwise, it is printed as an error. So users 75 Otherwise, it is printed as an error. So users could choose different 75 log level to filter out correctable error mess 76 log level to filter out correctable error messages. 76 77 77 Below shows an example:: 78 Below shows an example:: 78 79 79 0000:50:00.0: PCIe Bus Error: severity=Uncor 80 0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID) 80 0000:50:00.0: device [8086:0329] error sta 81 0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000 81 0000:50:00.0: [20] Unsupported Request 82 0000:50:00.0: [20] Unsupported Request (First) 82 0000:50:00.0: TLP Header: 04000001 00200a0 83 0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100 83 84 84 In the example, 'Requester ID' means the ID of !! 85 In the example, 'Requester ID' means the ID of the device who sends 85 the error message to the Root Port. Please ref !! 86 the error message to root port. Pls. refer to pci express specs for 86 fields. !! 87 other fields. 87 88 88 AER Statistics / Counters 89 AER Statistics / Counters 89 ------------------------- 90 ------------------------- 90 91 91 When PCIe AER errors are captured, the counter 92 When PCIe AER errors are captured, the counters / statistics are also exposed 92 in the form of sysfs attributes which are docu 93 in the form of sysfs attributes which are documented at 93 Documentation/ABI/testing/sysfs-bus-pci-device 94 Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats 94 95 95 Developer Guide 96 Developer Guide 96 =============== 97 =============== 97 98 98 To enable error recovery, a software driver mu !! 99 To enable AER aware support requires a software driver to configure >> 100 the AER capability structure within its device and to provide callbacks. 99 101 100 To support AER better, developers need to unde !! 102 To support AER better, developers need understand how AER does work >> 103 firstly. 101 104 102 PCIe errors are classified into two types: cor !! 105 PCI Express errors are classified into two types: correctable errors 103 and uncorrectable errors. This classification !! 106 and uncorrectable errors. This classification is based on the impacts 104 of those errors, which may result in degraded 107 of those errors, which may result in degraded performance or function 105 failure. 108 failure. 106 109 107 Correctable errors pose no impacts on the func 110 Correctable errors pose no impacts on the functionality of the 108 interface. The PCIe protocol can recover witho !! 111 interface. The PCI Express protocol can recover without any software 109 intervention or any loss of data. These errors 112 intervention or any loss of data. These errors are detected and 110 corrected by hardware. !! 113 corrected by hardware. Unlike correctable errors, uncorrectable 111 << 112 Unlike correctable errors, uncorrectable << 113 errors impact functionality of the interface. 114 errors impact functionality of the interface. Uncorrectable errors 114 can cause a particular transaction or a partic !! 115 can cause a particular transaction or a particular PCI Express link 115 to be unreliable. Depending on those error con 116 to be unreliable. Depending on those error conditions, uncorrectable 116 errors are further classified into non-fatal e 117 errors are further classified into non-fatal errors and fatal errors. 117 Non-fatal errors cause the particular transact 118 Non-fatal errors cause the particular transaction to be unreliable, 118 but the PCIe link itself is fully functional. !! 119 but the PCI Express link itself is fully functional. Fatal errors, on 119 the other hand, cause the link to be unreliabl 120 the other hand, cause the link to be unreliable. 120 121 121 When PCIe error reporting is enabled, a device !! 122 When AER is enabled, a PCI Express device will automatically send an 122 error message to the Root Port above it when i !! 123 error message to the PCIe root port above it when the device captures 123 an error. The Root Port, upon receiving an err 124 an error. The Root Port, upon receiving an error reporting message, 124 internally processes and logs the error messag !! 125 internally processes and logs the error message in its PCI Express 125 Capability structure. Error information being !! 126 capability structure. Error information being logged includes storing 126 the error reporting agent's requestor ID into 127 the error reporting agent's requestor ID into the Error Source 127 Identification Registers and setting the error 128 Identification Registers and setting the error bits of the Root Error 128 Status Register accordingly. If AER error repo !! 129 Status Register accordingly. If AER error reporting is enabled in Root 129 Error Command Register, the Root Port generate !! 130 Error Command Register, the Root Port generates an interrupt if an 130 error is detected. 131 error is detected. 131 132 132 Note that the errors as described above are re !! 133 Note that the errors as described above are related to the PCI Express 133 hierarchy and links. These errors do not inclu 134 hierarchy and links. These errors do not include any device specific 134 errors because device specific errors will sti 135 errors because device specific errors will still get sent directly to 135 the device driver. 136 the device driver. 136 137 >> 138 Configure the AER capability structure >> 139 -------------------------------------- >> 140 >> 141 AER aware drivers of PCI Express component need change the device >> 142 control registers to enable AER. They also could change AER registers, >> 143 including mask and severity registers. Helper function >> 144 pci_enable_pcie_error_reporting could be used to enable AER. See >> 145 section 3.3. >> 146 137 Provide callbacks 147 Provide callbacks 138 ----------------- 148 ----------------- 139 149 140 callback reset_link to reset PCIe link !! 150 callback reset_link to reset pci express link 141 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ !! 151 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> 152 >> 153 This callback is used to reset the pci express physical link when a >> 154 fatal error happens. The root port aer service driver provides a >> 155 default reset_link function, but different upstream ports might >> 156 have different specifications to reset pci express link, so all >> 157 upstream ports should provide their own reset_link functions. >> 158 >> 159 In struct pcie_port_service_driver, a new pointer, reset_link, is >> 160 added. >> 161 :: 142 162 143 This callback is used to reset the PCIe physic !! 163 pci_ers_result_t (*reset_link) (struct pci_dev *dev); 144 fatal error happens. The Root Port AER service << 145 default reset_link function, but different Ups << 146 have different specifications to reset the PCI << 147 Upstream Port drivers may provide their own re << 148 164 149 Section 3.2.2.2 provides more detailed info on 165 Section 3.2.2.2 provides more detailed info on when to call 150 reset_link. 166 reset_link. 151 167 152 PCI error-recovery callbacks 168 PCI error-recovery callbacks 153 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 169 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 154 170 155 The PCIe AER Root driver uses error callbacks !! 171 The PCI Express AER Root driver uses error callbacks to coordinate 156 with downstream device drivers associated with 172 with downstream device drivers associated with a hierarchy in question 157 when performing error recovery actions. 173 when performing error recovery actions. 158 174 159 Data struct pci_driver has a pointer, err_hand 175 Data struct pci_driver has a pointer, err_handler, to point to 160 pci_error_handlers who consists of a couple of 176 pci_error_handlers who consists of a couple of callback function 161 pointers. The AER driver follows the rules def !! 177 pointers. AER driver follows the rules defined in 162 pci-error-recovery.rst except PCIe-specific pa !! 178 pci-error-recovery.txt except pci express specific parts (e.g. 163 reset_link). Please refer to pci-error-recover !! 179 reset_link). Pls. refer to pci-error-recovery.txt for detailed 164 definitions of the callbacks. 180 definitions of the callbacks. 165 181 166 The sections below specify when to call the er !! 182 Below sections specify when to call the error callback functions. 167 183 168 Correctable errors 184 Correctable errors 169 ~~~~~~~~~~~~~~~~~~ 185 ~~~~~~~~~~~~~~~~~~ 170 186 171 Correctable errors pose no impacts on the func 187 Correctable errors pose no impacts on the functionality of 172 the interface. The PCIe protocol can recover w !! 188 the interface. The PCI Express protocol can recover without any 173 software intervention or any loss of data. The 189 software intervention or any loss of data. These errors do not 174 require any recovery actions. The AER driver c 190 require any recovery actions. The AER driver clears the device's 175 correctable error status register accordingly 191 correctable error status register accordingly and logs these errors. 176 192 177 Non-correctable (non-fatal and fatal) errors 193 Non-correctable (non-fatal and fatal) errors 178 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 194 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 179 195 180 If an error message indicates a non-fatal erro 196 If an error message indicates a non-fatal error, performing link reset 181 at upstream is not required. The AER driver ca 197 at upstream is not required. The AER driver calls error_detected(dev, 182 pci_channel_io_normal) to all drivers associat 198 pci_channel_io_normal) to all drivers associated within a hierarchy in 183 question. For example:: !! 199 question. for example:: 184 200 185 Endpoint <==> Downstream Port B <==> Upstrea !! 201 EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort 186 202 187 If Upstream Port A captures an AER error, the !! 203 If Upstream port A captures an AER error, the hierarchy consists of 188 Downstream Port B and Endpoint. !! 204 Downstream port B and EndPoint. 189 205 190 A driver may return PCI_ERS_RESULT_CAN_RECOVER 206 A driver may return PCI_ERS_RESULT_CAN_RECOVER, 191 PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_N 207 PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on 192 whether it can recover or the AER driver calls 208 whether it can recover or the AER driver calls mmio_enabled as next. 193 209 194 If an error message indicates a fatal error, k 210 If an error message indicates a fatal error, kernel will broadcast 195 error_detected(dev, pci_channel_io_frozen) to 211 error_detected(dev, pci_channel_io_frozen) to all drivers within 196 a hierarchy in question. Then, performing link 212 a hierarchy in question. Then, performing link reset at upstream is 197 necessary. As different kinds of devices might 213 necessary. As different kinds of devices might use different approaches 198 to reset link, AER port service driver is requ 214 to reset link, AER port service driver is required to provide the 199 function to reset link via callback parameter !! 215 function to reset link. Firstly, kernel looks for if the upstream 200 function. If reset_link is not NULL, recovery !! 216 component has an aer driver. If it has, kernel uses the reset_link 201 to reset the link. If error_detected returns P !! 217 callback of the aer driver. If the upstream component has no aer driver 202 and reset_link returns PCI_ERS_RESULT_RECOVERE !! 218 and the port is downstream port, we will perform a hot reset as the >> 219 default by setting the Secondary Bus Reset bit of the Bridge Control >> 220 register associated with the downstream port. As for upstream ports, >> 221 they should provide their own aer service drivers with reset_link >> 222 function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and >> 223 reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes 203 to mmio_enabled. 224 to mmio_enabled. 204 225 >> 226 helper functions >> 227 ---------------- >> 228 :: >> 229 >> 230 int pci_enable_pcie_error_reporting(struct pci_dev *dev); >> 231 >> 232 pci_enable_pcie_error_reporting enables the device to send error >> 233 messages to root port when an error is detected. Note that devices >> 234 don't enable the error reporting by default, so device drivers need >> 235 call this function to enable it. >> 236 >> 237 :: >> 238 >> 239 int pci_disable_pcie_error_reporting(struct pci_dev *dev); >> 240 >> 241 pci_disable_pcie_error_reporting disables the device to send error >> 242 messages to root port when an error is detected. >> 243 >> 244 :: >> 245 >> 246 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);` >> 247 >> 248 pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable >> 249 error status register. >> 250 205 Frequent Asked Questions 251 Frequent Asked Questions 206 ------------------------ 252 ------------------------ 207 253 208 Q: 254 Q: 209 What happens if a PCIe device driver does no !! 255 What happens if a PCI Express device driver does not provide an 210 error recovery handler (pci_driver->err_hand 256 error recovery handler (pci_driver->err_handler is equal to NULL)? 211 257 212 A: 258 A: 213 The devices attached with the driver won't b 259 The devices attached with the driver won't be recovered. If the 214 error is fatal, kernel will print out warnin 260 error is fatal, kernel will print out warning messages. Please refer 215 to section 3 for more information. 261 to section 3 for more information. 216 262 217 Q: 263 Q: 218 What happens if an upstream port service dri 264 What happens if an upstream port service driver does not provide 219 callback reset_link? 265 callback reset_link? 220 266 221 A: 267 A: 222 Fatal error recovery will fail if the errors 268 Fatal error recovery will fail if the errors are reported by the 223 upstream ports who are attached by the servi 269 upstream ports who are attached by the service driver. 224 270 >> 271 Q: >> 272 How does this infrastructure deal with driver that is not PCI >> 273 Express aware? >> 274 >> 275 A: >> 276 This infrastructure calls the error callback functions of the >> 277 driver when an error happens. But if the driver is not aware of >> 278 PCI Express, the device might not report its own errors to root >> 279 port. >> 280 >> 281 Q: >> 282 What modifications will that driver need to make it compatible >> 283 with the PCI Express AER Root driver? >> 284 >> 285 A: >> 286 It could call the helper functions to enable AER in devices and >> 287 cleanup uncorrectable status register. Pls. refer to section 3.3. >> 288 225 289 226 Software error injection 290 Software error injection 227 ======================== 291 ======================== 228 292 229 Debugging PCIe AER error recovery code is quit 293 Debugging PCIe AER error recovery code is quite difficult because it 230 is hard to trigger real hardware errors. Softw 294 is hard to trigger real hardware errors. Software based error 231 injection can be used to fake various kinds of 295 injection can be used to fake various kinds of PCIe errors. 232 296 233 First you should enable PCIe AER software erro 297 First you should enable PCIe AER software error injection in kernel 234 configuration, that is, following item should 298 configuration, that is, following item should be in your .config. 235 299 236 CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJE 300 CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m 237 301 238 After reboot with new kernel or insert the mod 302 After reboot with new kernel or insert the module, a device file named 239 /dev/aer_inject should be created. 303 /dev/aer_inject should be created. 240 304 241 Then, you need a user space tool named aer-inj 305 Then, you need a user space tool named aer-inject, which can be gotten 242 from: 306 from: 243 307 244 https://github.com/intel/aer-inject.git !! 308 https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ 245 309 246 More information about aer-inject can be found !! 310 More information about aer-inject can be found in the document comes 247 its source code. !! 311 with its source code.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.