1 ========================== 2 PCI Bus EEH Error Recovery 3 ========================== 4 5 Linas Vepstas <linas@austin.ibm.com> 6 7 12 January 2005 8 9 10 Overview: 11 --------- 12 The IBM POWER-based pSeries and iSeries comput 13 controller chips that have extended capabiliti 14 reporting a large variety of PCI bus error con 15 go under the name of "EEH", for "Enhanced Erro 16 hardware features allow PCI bus errors to be c 17 card to be "rebooted", without also having to 18 system. 19 20 This is in contrast to traditional PCI error h 21 PCI chip is wired directly to the CPU, and an 22 a CPU machine-check/check-stop condition, halt 23 Another "traditional" technique is to ignore s 24 can lead to data corruption, both of user data 25 hung/unresponsive adapters, or system crashes/ 26 the idea behind EEH is that the operating syst 27 reliable and robust by protecting it from PCI 28 the OS the ability to "reboot"/recover individ 29 30 Future systems from other vendors, based on th 31 may contain similar features. 32 33 34 Causes of EEH Errors 35 -------------------- 36 EEH was originally designed to guard against h 37 as PCI cards dying from heat, humidity, dust, 38 electrical connections. The vast majority of E 39 "real life" are due to either poorly seated PC 40 unfortunately quite commonly, due to device dr 41 bugs, and sometimes PCI card hardware bugs. 42 43 The most common software bug, is one that caus 44 attempt to DMA to a location in system memory 45 reserved for DMA access for that card. This i 46 as it prevents what; otherwise, would have bee 47 corruption caused by the bad DMA. A number of 48 bugs have been found and fixed in this way ove 49 years. Other possible causes of EEH errors in 50 address line parity errors (for example, due t 51 connectivity due to a poorly seated card), and 52 errors (due to software, device firmware, or d 53 The vast majority of "true hardware failures" 54 physically removing and re-seating the PCI car 55 56 57 Detection and Recovery 58 ---------------------- 59 In the following discussion, a generic overvie 60 and recover from EEH errors will be presented. 61 by an overview of how the current implementati 62 kernel does it. The actual implementation is 63 and some of the finer points are still being d 64 may in turn be swayed if or when other archite 65 similar functionality. 66 67 When a PCI Host Bridge (PHB, the bus controlle 68 PCI bus to the system CPU electronics complex) 69 condition, it will "isolate" the affected PCI 70 will block all writes (either to the card from 71 from the card to the system), and it will caus 72 return all-ff's (0xff, 0xffff, 0xffffffff for 73 This value was chosen because it is the same v 74 get if the device was physically unplugged fro 75 This includes access to PCI memory, I/O space, 76 space. Interrupts; however, will continue to 77 78 Detection and recovery are performed with the 79 firmware. The programming interfaces in the L 80 into the firmware are referred to as RTAS (Run 81 Services). The Linux kernel does not (should 82 the EEH function in the PCI chipsets directly, 83 there are a number of different chipsets out t 84 different interfaces and quirks. The firmware 85 uniform abstraction layer that will work with 86 and iSeries hardware (and be forwards-compatib 87 88 If the OS or device driver suspects that a PCI 89 EEH-isolated, there is a firmware call it can 90 this is the case. If so, then the device drive 91 into a consistent state (given that it won't b 92 pending work) and start recovery of the card. 93 would consist of resetting the PCI device (hol 94 line high for two seconds), followed by settin 95 config space (the base address registers (BAR' 96 cache line size, interrupt line, and so on). 97 reinitialization of the device driver. In a w 98 the power to the card can be toggled, at least 99 slots. In principle, layers far above the dev 100 do not need to know that the PCI card has been 101 way; ideally, there should be at most a pause 102 I/O while the card is being reset. 103 104 If the card cannot be recovered after three or 105 kernel/device driver should assume the worst-c 106 card has died completely, and report this erro 107 In addition, error messages are reported throu 108 syslogd (/var/log/messages) to alert the sysad 109 The correct way to deal with failed adapters i 110 PCI hotplug tools to remove and replace the de 111 112 113 Current PPC64 Linux EEH Implementation 114 -------------------------------------- 115 At this time, a generic EEH recovery mechanism 116 so that individual device drivers do not need 117 EEH recovery. This generic mechanism piggy-ba 118 infrastructure, and percolates events up thro 119 infrastructure. Following is a detailed descr 120 accomplished. 121 122 EEH must be enabled in the PHB's very early du 123 and if a PCI slot is hot-plugged. The former i 124 eeh_init() in arch/powerpc/platforms/pseries/e 125 drivers/pci/hotplug/pSeries_pci.c calling in t 126 EEH must be enabled before a PCI scan of the d 127 Current Power5 hardware will not work unless E 128 although older Power4 can run with it disabled 129 EEH can no longer be turned off. PCI devices 130 registered with the EEH code; the EEH code nee 131 the I/O address ranges of the PCI device in or 132 error. Given an arbitrary address, the routin 133 pci_get_device_by_addr() will find the pci dev 134 with that address (if any). 135 136 The default arch/powerpc/include/asm/io.h macr 137 etc. include a check to see if the i/o read re 138 If so, these make a call to eeh_dn_check_failu 139 asks the firmware if the all-ff's value is the 140 error. If it is not, processing continues as 141 total number of these false alarms or "false p 142 seen in /proc/ppc64/eeh (subject to change). 143 all of these occur during boot, when the PCI b 144 a large number of 0xff reads are part of the b 145 146 If a frozen slot is detected, code in 147 arch/powerpc/platforms/pseries/eeh.c will prin 148 syslog (/var/log/messages). This stack trace 149 useful to device-driver authors for finding ou 150 error was detected, as the error itself usuall 151 beforehand. 152 153 Next, it uses the Linux kernel notifier chain/ 154 allow any interested parties to find out about 155 drivers, or other parts of the kernel, can use 156 `eeh_register_notifier(struct notifier_block * 157 events. The event will include a pointer to t 158 device node and some state info. Receivers of 159 they wish"; the default handler will be descri 160 section. 161 162 To assist in the recovery of the device, eeh.c 163 following functions: 164 165 rtas_set_slot_reset() 166 assert the PCI #RST line for 1/8th of a se 167 rtas_configure_bridge() 168 ask firmware to configure any PCI bridges 169 located topologically under the pci slot. 170 eeh_save_bars() and eeh_restore_bars(): 171 save and restore the PCI 172 config-space info for a device and any devi 173 174 175 A handler for the EEH notifier_block events is 176 drivers/pci/hotplug/pSeries_pci.c, called hand 177 It saves the device BAR's and then calls rpaph 178 This last call causes the device driver for th 179 which causes uevents to go out to user space. 180 user-space scripts that might issue commands s 181 for ethernet cards, and so on. This handler t 182 hoping to give the user-space scripts enough t 183 It then resets the PCI card, reconfigures the 184 any bridges underneath. It then calls rpaphp_e 185 which restarts the device driver and triggers 186 events (for example, calling "ifup eth0" for e 187 188 189 Device Shutdown and User-Space Events 190 ------------------------------------- 191 This section documents what happens when a pci 192 focusing on how the device driver gets shut do 193 events get delivered to user-space scripts. 194 195 Following is an example sequence of events tha 196 close function to be called during the first p 197 The following sequence is an example of the pc 198 199 rpa_php_unconfig_pci_adapter (struct slot 200 { 201 calls 202 pci_remove_bus_device (struct pci_dev *) 203 { 204 calls 205 pci_destroy_dev (struct pci_dev *) 206 { 207 calls 208 device_unregister (&dev->dev) // in 209 { 210 calls 211 device_del (struct device *) 212 { 213 calls 214 bus_remove_device() // in /drive 215 { 216 calls 217 device_release_driver() 218 { 219 calls 220 struct device_driver->remove 221 pci_device_remove() // in / 222 { 223 calls 224 struct pci_driver->remove( 225 pcnet32_remove_one() // in 226 { 227 calls 228 unregister_netdev() // i 229 { 230 calls 231 dev_close() // in /ne 232 { 233 calls dev->stop(); 234 which is just pcnet 235 { 236 which does what y 237 to stop the devic 238 } 239 } 240 } 241 which 242 frees pcnet32 device driver 243 } 244 }}}}}} 245 246 247 in drivers/pci/pci_driver.c, 248 struct device_driver->remove() is just pci_dev 249 which calls struct pci_driver->remove() which 250 which calls unregister_netdev() (in net/core/ 251 which calls dev_close() (in net/core/dev.c) 252 which calls dev->stop() which is pcnet32_close 253 which then does the appropriate shutdown. 254 255 --- 256 257 Following is the analogous stack trace for eve 258 when the pci device is unconfigured:: 259 260 rpa_php_unconfig_pci_adapter() { 261 calls 262 pci_remove_bus_device (struct pci_dev *) { 263 calls 264 pci_destroy_dev (struct pci_dev *) { 265 calls 266 device_unregister (&dev->dev) { 267 calls 268 device_del(struct device * dev) { 269 calls 270 kobject_del() { 271 calls 272 kobject_uevent() { 273 calls 274 kset_uevent() { 275 calls 276 kset->uevent_ops->uevent() 277 a call to 278 dev_uevent() { 279 calls 280 dev->bus->uevent() which i 281 pci_uevent () { 282 which prints device name 283 } 284 } 285 then kobject_uevent() sends a 286 --> userspace uevent 287 (during early boot, nobody li 288 kobject_uevent() executes uev 289 event process /sbin/hotplug) 290 } 291 } 292 kobject_del() then calls sysfs_remo 293 trigger any user-space daemon that 294 and notice the delete event. 295 296 297 Pro's and Con's of the Current Design 298 ------------------------------------- 299 There are several issues with the current EEH 300 which may be addressed in future revisions. B 301 big plus of the current design is that no chan 302 individual device drivers, so that the current 303 The biggest negative of the design is that it 304 network daemons and file systems that didn't n 305 306 - A minor complaint is that resetting the net 307 user-space back-to-back ifdown/ifup burps t 308 network daemons, that didn't need to even k 309 card was being rebooted. 310 311 - A more serious concern is that the same res 312 causes havoc to mounted file systems. Scri 313 unmount a file system without flushing pend 314 is impossible, because I/O has already been 315 ideally, the reset should happen at or belo 316 so that the file systems are not disturbed. 317 318 Reiserfs does not tolerate errors returned 319 Ext3fs seems to be tolerant, retrying reads 320 succeed. Both have been only lightly tested 321 322 The SCSI-generic subsystem already has buil 323 SCSI device resets, SCSI bus resets, and SC 324 (HBA) resets. These are cascaded into a ch 325 resets if a SCSI command fails. These are c 326 from the block layer. It would be very nat 327 reset into this chain of events. 328 329 - If a SCSI error occurs for the root device, 330 the sysadmin had the foresight to run /bin, 331 and so on, out of ramdisk/tmpfs. 332 333 334 Conclusions 335 ----------- 336 There's forward progress ...
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.