~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/PCI/pcieaer-howto.rst

Version: ~ [ linux-6.11.5 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.58 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.114 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.169 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.228 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.284 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.322 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.9 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/PCI/pcieaer-howto.rst (Version linux-6.11.5) and /Documentation/PCI/pcieaer-howto.rst (Version linux-6.2.16)


  1 .. SPDX-License-Identifier: GPL-2.0                 1 .. SPDX-License-Identifier: GPL-2.0
  2 .. include:: <isonum.txt>                           2 .. include:: <isonum.txt>
  3                                                     3 
  4 ==============================================      4 ===========================================================
  5 The PCI Express Advanced Error Reporting Drive      5 The PCI Express Advanced Error Reporting Driver Guide HOWTO
  6 ==============================================      6 ===========================================================
  7                                                     7 
  8 :Authors: - T. Long Nguyen <tom.l.nguyen@intel.      8 :Authors: - T. Long Nguyen <tom.l.nguyen@intel.com>
  9           - Yanmin Zhang <yanmin.zhang@intel.co      9           - Yanmin Zhang <yanmin.zhang@intel.com>
 10                                                    10 
 11 :Copyright: |copy| 2006 Intel Corporation          11 :Copyright: |copy| 2006 Intel Corporation
 12                                                    12 
 13 Overview                                           13 Overview
 14 ===========                                        14 ===========
 15                                                    15 
 16 About this guide                                   16 About this guide
 17 ----------------                                   17 ----------------
 18                                                    18 
 19 This guide describes the basics of the PCI Exp !!  19 This guide describes the basics of the PCI Express Advanced Error
 20 Reporting (AER) driver and provides informatio     20 Reporting (AER) driver and provides information on how to use it, as
 21 well as how to enable the drivers of Endpoint  !!  21 well as how to enable the drivers of endpoint devices to conform with
 22 the PCIe AER driver.                           !!  22 PCI Express AER driver.
 23                                                    23 
 24                                                    24 
 25 What is the PCIe AER Driver?                   !!  25 What is the PCI Express AER Driver?
 26 ----------------------------                   !!  26 -----------------------------------
 27                                                    27 
 28 PCIe error signaling can occur on the PCIe lin !!  28 PCI Express error signaling can occur on the PCI Express link itself
 29 or on behalf of transactions initiated on the  !!  29 or on behalf of transactions initiated on the link. PCI Express
 30 defines two error reporting paradigms: the bas     30 defines two error reporting paradigms: the baseline capability and
 31 the Advanced Error Reporting capability. The b     31 the Advanced Error Reporting capability. The baseline capability is
 32 required of all PCIe components providing a mi !!  32 required of all PCI Express components providing a minimum defined
 33 set of error reporting requirements. Advanced      33 set of error reporting requirements. Advanced Error Reporting
 34 capability is implemented with a PCIe Advanced !!  34 capability is implemented with a PCI Express advanced error reporting
 35 extended capability structure providing more r     35 extended capability structure providing more robust error reporting.
 36                                                    36 
 37 The PCIe AER driver provides the infrastructur !!  37 The PCI Express AER driver provides the infrastructure to support PCI
 38 Error Reporting capability. The PCIe AER drive !!  38 Express Advanced Error Reporting capability. The PCI Express AER
 39 functions:                                     !!  39 driver provides three basic functions:
 40                                                    40 
 41   - Gathers the comprehensive error informatio     41   - Gathers the comprehensive error information if errors occurred.
 42   - Reports error to the users.                    42   - Reports error to the users.
 43   - Performs error recovery actions.               43   - Performs error recovery actions.
 44                                                    44 
 45 The AER driver only attaches to Root Ports and !!  45 AER driver only attaches root ports which support PCI-Express AER
 46 AER capability.                                !!  46 capability.
 47                                                    47 
 48                                                    48 
 49 User Guide                                         49 User Guide
 50 ==========                                         50 ==========
 51                                                    51 
 52 Include the PCIe AER Root Driver into the Linu !!  52 Include the PCI Express AER Root Driver into the Linux Kernel
 53 ---------------------------------------------- !!  53 -------------------------------------------------------------
 54                                                    54 
 55 The PCIe AER driver is a Root Port service dri !!  55 The PCI Express AER Root driver is a Root Port service driver attached
 56 via the PCIe Port Bus driver. If a user wants  !!  56 to the PCI Express Port Bus driver. If a user wants to use it, the driver
 57 must be compiled. It is enabled with CONFIG_PC !!  57 has to be compiled. Option CONFIG_PCIEAER supports this capability. It
 58 depends on CONFIG_PCIEPORTBUS.                 !!  58 depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and
                                                   >>  59 CONFIG_PCIEAER = y.
 59                                                    60 
 60 Load PCIe AER Root Driver                      !!  61 Load PCI Express AER Root Driver
 61 -------------------------                      !!  62 --------------------------------
 62                                                    63 
 63 Some systems have AER support in firmware. Ena     64 Some systems have AER support in firmware. Enabling Linux AER support at
 64 the same time the firmware handles AER would r !!  65 the same time the firmware handles AER may result in unpredictable
 65 behavior. Therefore, Linux does not handle AER     66 behavior. Therefore, Linux does not handle AER events unless the firmware
 66 grants AER control to the OS via the ACPI _OSC !!  67 grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0
 67 Specification for details regarding _OSC usage     68 Specification for details regarding _OSC usage.
 68                                                    69 
 69 AER error output                                   70 AER error output
 70 ----------------                                   71 ----------------
 71                                                    72 
 72 When a PCIe AER error is captured, an error me     73 When a PCIe AER error is captured, an error message will be output to
 73 console. If it's a correctable error, it is ou !!  74 console. If it's a correctable error, it is output as a warning.
 74 Otherwise, it is printed as an error. So users     75 Otherwise, it is printed as an error. So users could choose different
 75 log level to filter out correctable error mess     76 log level to filter out correctable error messages.
 76                                                    77 
 77 Below shows an example::                           78 Below shows an example::
 78                                                    79 
 79   0000:50:00.0: PCIe Bus Error: severity=Uncor     80   0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
 80   0000:50:00.0:   device [8086:0329] error sta     81   0000:50:00.0:   device [8086:0329] error status/mask=00100000/00000000
 81   0000:50:00.0:    [20] Unsupported Request        82   0000:50:00.0:    [20] Unsupported Request    (First)
 82   0000:50:00.0:   TLP Header: 04000001 00200a0     83   0000:50:00.0:   TLP Header: 04000001 00200a03 05010000 00050100
 83                                                    84 
 84 In the example, 'Requester ID' means the ID of !!  85 In the example, 'Requester ID' means the ID of the device who sends
 85 the error message to the Root Port. Please ref !!  86 the error message to root port. Pls. refer to pci express specs for
 86 fields.                                        !!  87 other fields.
 87                                                    88 
 88 AER Statistics / Counters                          89 AER Statistics / Counters
 89 -------------------------                          90 -------------------------
 90                                                    91 
 91 When PCIe AER errors are captured, the counter     92 When PCIe AER errors are captured, the counters / statistics are also exposed
 92 in the form of sysfs attributes which are docu     93 in the form of sysfs attributes which are documented at
 93 Documentation/ABI/testing/sysfs-bus-pci-device     94 Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
 94                                                    95 
 95 Developer Guide                                    96 Developer Guide
 96 ===============                                    97 ===============
 97                                                    98 
 98 To enable error recovery, a software driver mu !!  99 To enable AER aware support requires a software driver to configure
                                                   >> 100 the AER capability structure within its device and to provide callbacks.
 99                                                   101 
100 To support AER better, developers need to unde !! 102 To support AER better, developers need understand how AER does work
                                                   >> 103 firstly.
101                                                   104 
102 PCIe errors are classified into two types: cor !! 105 PCI Express errors are classified into two types: correctable errors
103 and uncorrectable errors. This classification  !! 106 and uncorrectable errors. This classification is based on the impacts
104 of those errors, which may result in degraded     107 of those errors, which may result in degraded performance or function
105 failure.                                          108 failure.
106                                                   109 
107 Correctable errors pose no impacts on the func    110 Correctable errors pose no impacts on the functionality of the
108 interface. The PCIe protocol can recover witho !! 111 interface. The PCI Express protocol can recover without any software
109 intervention or any loss of data. These errors    112 intervention or any loss of data. These errors are detected and
110 corrected by hardware.                         !! 113 corrected by hardware. Unlike correctable errors, uncorrectable
111                                                << 
112 Unlike correctable errors, uncorrectable       << 
113 errors impact functionality of the interface.     114 errors impact functionality of the interface. Uncorrectable errors
114 can cause a particular transaction or a partic !! 115 can cause a particular transaction or a particular PCI Express link
115 to be unreliable. Depending on those error con    116 to be unreliable. Depending on those error conditions, uncorrectable
116 errors are further classified into non-fatal e    117 errors are further classified into non-fatal errors and fatal errors.
117 Non-fatal errors cause the particular transact    118 Non-fatal errors cause the particular transaction to be unreliable,
118 but the PCIe link itself is fully functional.  !! 119 but the PCI Express link itself is fully functional. Fatal errors, on
119 the other hand, cause the link to be unreliabl    120 the other hand, cause the link to be unreliable.
120                                                   121 
121 When PCIe error reporting is enabled, a device !! 122 When AER is enabled, a PCI Express device will automatically send an
122 error message to the Root Port above it when i !! 123 error message to the PCIe root port above it when the device captures
123 an error. The Root Port, upon receiving an err    124 an error. The Root Port, upon receiving an error reporting message,
124 internally processes and logs the error messag !! 125 internally processes and logs the error message in its PCI Express
125 Capability structure. Error information being  !! 126 capability structure. Error information being logged includes storing
126 the error reporting agent's requestor ID into     127 the error reporting agent's requestor ID into the Error Source
127 Identification Registers and setting the error    128 Identification Registers and setting the error bits of the Root Error
128 Status Register accordingly. If AER error repo !! 129 Status Register accordingly. If AER error reporting is enabled in Root
129 Error Command Register, the Root Port generate !! 130 Error Command Register, the Root Port generates an interrupt if an
130 error is detected.                                131 error is detected.
131                                                   132 
132 Note that the errors as described above are re !! 133 Note that the errors as described above are related to the PCI Express
133 hierarchy and links. These errors do not inclu    134 hierarchy and links. These errors do not include any device specific
134 errors because device specific errors will sti    135 errors because device specific errors will still get sent directly to
135 the device driver.                                136 the device driver.
136                                                   137 
                                                   >> 138 Configure the AER capability structure
                                                   >> 139 --------------------------------------
                                                   >> 140 
                                                   >> 141 AER aware drivers of PCI Express component need change the device
                                                   >> 142 control registers to enable AER. They also could change AER registers,
                                                   >> 143 including mask and severity registers. Helper function
                                                   >> 144 pci_enable_pcie_error_reporting could be used to enable AER. See
                                                   >> 145 section 3.3.
                                                   >> 146 
137 Provide callbacks                                 147 Provide callbacks
138 -----------------                                 148 -----------------
139                                                   149 
140 callback reset_link to reset PCIe link         !! 150 callback reset_link to reset pci express link
141 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~         !! 151 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
142                                                   152 
143 This callback is used to reset the PCIe physic !! 153 This callback is used to reset the pci express physical link when a
144 fatal error happens. The Root Port AER service !! 154 fatal error happens. The root port aer service driver provides a
145 default reset_link function, but different Ups !! 155 default reset_link function, but different upstream ports might
146 have different specifications to reset the PCI !! 156 have different specifications to reset pci express link, so all
147 Upstream Port drivers may provide their own re !! 157 upstream ports should provide their own reset_link functions.
148                                                   158 
149 Section 3.2.2.2 provides more detailed info on    159 Section 3.2.2.2 provides more detailed info on when to call
150 reset_link.                                       160 reset_link.
151                                                   161 
152 PCI error-recovery callbacks                      162 PCI error-recovery callbacks
153 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~                      163 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
154                                                   164 
155 The PCIe AER Root driver uses error callbacks  !! 165 The PCI Express AER Root driver uses error callbacks to coordinate
156 with downstream device drivers associated with    166 with downstream device drivers associated with a hierarchy in question
157 when performing error recovery actions.           167 when performing error recovery actions.
158                                                   168 
159 Data struct pci_driver has a pointer, err_hand    169 Data struct pci_driver has a pointer, err_handler, to point to
160 pci_error_handlers who consists of a couple of    170 pci_error_handlers who consists of a couple of callback function
161 pointers. The AER driver follows the rules def !! 171 pointers. AER driver follows the rules defined in
162 pci-error-recovery.rst except PCIe-specific pa !! 172 pci-error-recovery.txt except pci express specific parts (e.g.
163 reset_link). Please refer to pci-error-recover !! 173 reset_link). Pls. refer to pci-error-recovery.txt for detailed
164 definitions of the callbacks.                     174 definitions of the callbacks.
165                                                   175 
166 The sections below specify when to call the er !! 176 Below sections specify when to call the error callback functions.
167                                                   177 
168 Correctable errors                                178 Correctable errors
169 ~~~~~~~~~~~~~~~~~~                                179 ~~~~~~~~~~~~~~~~~~
170                                                   180 
171 Correctable errors pose no impacts on the func    181 Correctable errors pose no impacts on the functionality of
172 the interface. The PCIe protocol can recover w !! 182 the interface. The PCI Express protocol can recover without any
173 software intervention or any loss of data. The    183 software intervention or any loss of data. These errors do not
174 require any recovery actions. The AER driver c    184 require any recovery actions. The AER driver clears the device's
175 correctable error status register accordingly     185 correctable error status register accordingly and logs these errors.
176                                                   186 
177 Non-correctable (non-fatal and fatal) errors      187 Non-correctable (non-fatal and fatal) errors
178 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~      188 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
179                                                   189 
180 If an error message indicates a non-fatal erro    190 If an error message indicates a non-fatal error, performing link reset
181 at upstream is not required. The AER driver ca    191 at upstream is not required. The AER driver calls error_detected(dev,
182 pci_channel_io_normal) to all drivers associat    192 pci_channel_io_normal) to all drivers associated within a hierarchy in
183 question. For example::                        !! 193 question. for example::
184                                                   194 
185   Endpoint <==> Downstream Port B <==> Upstrea !! 195   EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort
186                                                   196 
187 If Upstream Port A captures an AER error, the  !! 197 If Upstream port A captures an AER error, the hierarchy consists of
188 Downstream Port B and Endpoint.                !! 198 Downstream port B and EndPoint.
189                                                   199 
190 A driver may return PCI_ERS_RESULT_CAN_RECOVER    200 A driver may return PCI_ERS_RESULT_CAN_RECOVER,
191 PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_N    201 PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
192 whether it can recover or the AER driver calls    202 whether it can recover or the AER driver calls mmio_enabled as next.
193                                                   203 
194 If an error message indicates a fatal error, k    204 If an error message indicates a fatal error, kernel will broadcast
195 error_detected(dev, pci_channel_io_frozen) to     205 error_detected(dev, pci_channel_io_frozen) to all drivers within
196 a hierarchy in question. Then, performing link    206 a hierarchy in question. Then, performing link reset at upstream is
197 necessary. As different kinds of devices might    207 necessary. As different kinds of devices might use different approaches
198 to reset link, AER port service driver is requ    208 to reset link, AER port service driver is required to provide the
199 function to reset link via callback parameter     209 function to reset link via callback parameter of pcie_do_recovery()
200 function. If reset_link is not NULL, recovery     210 function. If reset_link is not NULL, recovery function will use it
201 to reset the link. If error_detected returns P    211 to reset the link. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER
202 and reset_link returns PCI_ERS_RESULT_RECOVERE    212 and reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
203 to mmio_enabled.                                  213 to mmio_enabled.
204                                                   214 
                                                   >> 215 helper functions
                                                   >> 216 ----------------
                                                   >> 217 ::
                                                   >> 218 
                                                   >> 219   int pci_enable_pcie_error_reporting(struct pci_dev *dev);
                                                   >> 220 
                                                   >> 221 pci_enable_pcie_error_reporting enables the device to send error
                                                   >> 222 messages to root port when an error is detected. Note that devices
                                                   >> 223 don't enable the error reporting by default, so device drivers need
                                                   >> 224 call this function to enable it.
                                                   >> 225 
                                                   >> 226 ::
                                                   >> 227 
                                                   >> 228   int pci_disable_pcie_error_reporting(struct pci_dev *dev);
                                                   >> 229 
                                                   >> 230 pci_disable_pcie_error_reporting disables the device to send error
                                                   >> 231 messages to root port when an error is detected.
                                                   >> 232 
                                                   >> 233 ::
                                                   >> 234 
                                                   >> 235   int pci_aer_clear_nonfatal_status(struct pci_dev *dev);`
                                                   >> 236 
                                                   >> 237 pci_aer_clear_nonfatal_status clears non-fatal errors in the uncorrectable
                                                   >> 238 error status register.
                                                   >> 239 
205 Frequent Asked Questions                          240 Frequent Asked Questions
206 ------------------------                          241 ------------------------
207                                                   242 
208 Q:                                                243 Q:
209   What happens if a PCIe device driver does no !! 244   What happens if a PCI Express device driver does not provide an
210   error recovery handler (pci_driver->err_hand    245   error recovery handler (pci_driver->err_handler is equal to NULL)?
211                                                   246 
212 A:                                                247 A:
213   The devices attached with the driver won't b    248   The devices attached with the driver won't be recovered. If the
214   error is fatal, kernel will print out warnin    249   error is fatal, kernel will print out warning messages. Please refer
215   to section 3 for more information.              250   to section 3 for more information.
216                                                   251 
217 Q:                                                252 Q:
218   What happens if an upstream port service dri    253   What happens if an upstream port service driver does not provide
219   callback reset_link?                            254   callback reset_link?
220                                                   255 
221 A:                                                256 A:
222   Fatal error recovery will fail if the errors    257   Fatal error recovery will fail if the errors are reported by the
223   upstream ports who are attached by the servi    258   upstream ports who are attached by the service driver.
224                                                   259 
                                                   >> 260 Q:
                                                   >> 261   How does this infrastructure deal with driver that is not PCI
                                                   >> 262   Express aware?
                                                   >> 263 
                                                   >> 264 A:
                                                   >> 265   This infrastructure calls the error callback functions of the
                                                   >> 266   driver when an error happens. But if the driver is not aware of
                                                   >> 267   PCI Express, the device might not report its own errors to root
                                                   >> 268   port.
                                                   >> 269 
                                                   >> 270 Q:
                                                   >> 271   What modifications will that driver need to make it compatible
                                                   >> 272   with the PCI Express AER Root driver?
                                                   >> 273 
                                                   >> 274 A:
                                                   >> 275   It could call the helper functions to enable AER in devices and
                                                   >> 276   cleanup uncorrectable status register. Pls. refer to section 3.3.
                                                   >> 277 
225                                                   278 
226 Software error injection                          279 Software error injection
227 ========================                          280 ========================
228                                                   281 
229 Debugging PCIe AER error recovery code is quit    282 Debugging PCIe AER error recovery code is quite difficult because it
230 is hard to trigger real hardware errors. Softw    283 is hard to trigger real hardware errors. Software based error
231 injection can be used to fake various kinds of    284 injection can be used to fake various kinds of PCIe errors.
232                                                   285 
233 First you should enable PCIe AER software erro    286 First you should enable PCIe AER software error injection in kernel
234 configuration, that is, following item should     287 configuration, that is, following item should be in your .config.
235                                                   288 
236 CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJE    289 CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m
237                                                   290 
238 After reboot with new kernel or insert the mod    291 After reboot with new kernel or insert the module, a device file named
239 /dev/aer_inject should be created.                292 /dev/aer_inject should be created.
240                                                   293 
241 Then, you need a user space tool named aer-inj    294 Then, you need a user space tool named aer-inject, which can be gotten
242 from:                                             295 from:
243                                                   296 
244     https://github.com/intel/aer-inject.git    !! 297     https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/
245                                                   298 
246 More information about aer-inject can be found !! 299 More information about aer-inject can be found in the document comes
247 its source code.                               !! 300 with its source code.
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php