1 .. SPDX-License-Identifier: GPL-2.0 2 3 =========================== 4 The Spidernet Device Driver 5 =========================== 6 7 Written by Linas Vepstas <linas@austin.ibm.com> 8 9 Version of 7 June 2007 10 11 Abstract 12 ======== 13 This document sketches the structure of portions of the spidernet 14 device driver in the Linux kernel tree. The spidernet is a gigabit 15 ethernet device built into the Toshiba southbridge commonly used 16 in the SONY Playstation 3 and the IBM QS20 Cell blade. 17 18 The Structure of the RX Ring. 19 ============================= 20 The receive (RX) ring is a circular linked list of RX descriptors, 21 together with three pointers into the ring that are used to manage its 22 contents. 23 24 The elements of the ring are called "descriptors" or "descrs"; they 25 describe the received data. This includes a pointer to a buffer 26 containing the received data, the buffer size, and various status bits. 27 28 There are three primary states that a descriptor can be in: "empty", 29 "full" and "not-in-use". An "empty" or "ready" descriptor is ready 30 to receive data from the hardware. A "full" descriptor has data in it, 31 and is waiting to be emptied and processed by the OS. A "not-in-use" 32 descriptor is neither empty or full; it is simply not ready. It may 33 not even have a data buffer in it, or is otherwise unusable. 34 35 During normal operation, on device startup, the OS (specifically, the 36 spidernet device driver) allocates a set of RX descriptors and RX 37 buffers. These are all marked "empty", ready to receive data. This 38 ring is handed off to the hardware, which sequentially fills in the 39 buffers, and marks them "full". The OS follows up, taking the full 40 buffers, processing them, and re-marking them empty. 41 42 This filling and emptying is managed by three pointers, the "head" 43 and "tail" pointers, managed by the OS, and a hardware current 44 descriptor pointer (GDACTDPA). The GDACTDPA points at the descr 45 currently being filled. When this descr is filled, the hardware 46 marks it full, and advances the GDACTDPA by one. Thus, when there is 47 flowing RX traffic, every descr behind it should be marked "full", 48 and everything in front of it should be "empty". If the hardware 49 discovers that the current descr is not empty, it will signal an 50 interrupt, and halt processing. 51 52 The tail pointer tails or trails the hardware pointer. When the 53 hardware is ahead, the tail pointer will be pointing at a "full" 54 descr. The OS will process this descr, and then mark it "not-in-use", 55 and advance the tail pointer. Thus, when there is flowing RX traffic, 56 all of the descrs in front of the tail pointer should be "full", and 57 all of those behind it should be "not-in-use". When RX traffic is not 58 flowing, then the tail pointer can catch up to the hardware pointer. 59 The OS will then note that the current tail is "empty", and halt 60 processing. 61 62 The head pointer (somewhat mis-named) follows after the tail pointer. 63 When traffic is flowing, then the head pointer will be pointing at 64 a "not-in-use" descr. The OS will perform various housekeeping duties 65 on this descr. This includes allocating a new data buffer and 66 dma-mapping it so as to make it visible to the hardware. The OS will 67 then mark the descr as "empty", ready to receive data. Thus, when there 68 is flowing RX traffic, everything in front of the head pointer should 69 be "not-in-use", and everything behind it should be "empty". If no 70 RX traffic is flowing, then the head pointer can catch up to the tail 71 pointer, at which point the OS will notice that the head descr is 72 "empty", and it will halt processing. 73 74 Thus, in an idle system, the GDACTDPA, tail and head pointers will 75 all be pointing at the same descr, which should be "empty". All of the 76 other descrs in the ring should be "empty" as well. 77 78 The show_rx_chain() routine will print out the locations of the 79 GDACTDPA, tail and head pointers. It will also summarize the contents 80 of the ring, starting at the tail pointer, and listing the status 81 of the descrs that follow. 82 83 A typical example of the output, for a nearly idle system, might be:: 84 85 net eth1: Total number of descrs=256 86 net eth1: Chain tail located at descr=20 87 net eth1: Chain head is at 20 88 net eth1: HW curr desc (GDACTDPA) is at 21 89 net eth1: Have 1 descrs with stat=x40800101 90 net eth1: HW next desc (GDACNEXTDA) is at 22 91 net eth1: Last 255 descrs with stat=xa0800000 92 93 In the above, the hardware has filled in one descr, number 20. Both 94 head and tail are pointing at 20, because it has not yet been emptied. 95 Meanwhile, hw is pointing at 21, which is free. 96 97 The "Have nnn decrs" refers to the descr starting at the tail: in this 98 case, nnn=1 descr, starting at descr 20. The "Last nnn descrs" refers 99 to all of the rest of the descrs, from the last status change. The "nnn" 100 is a count of how many descrs have exactly the same status. 101 102 The status x4... corresponds to "full" and status xa... corresponds 103 to "empty". The actual value printed is RXCOMST_A. 104 105 In the device driver source code, a different set of names are 106 used for these same concepts, so that:: 107 108 "empty" == SPIDER_NET_DESCR_CARDOWNED == 0xa 109 "full" == SPIDER_NET_DESCR_FRAME_END == 0x4 110 "not in use" == SPIDER_NET_DESCR_NOT_IN_USE == 0xf 111 112 113 The RX RAM full bug/feature 114 =========================== 115 116 As long as the OS can empty out the RX buffers at a rate faster than 117 the hardware can fill them, there is no problem. If, for some reason, 118 the OS fails to empty the RX ring fast enough, the hardware GDACTDPA 119 pointer will catch up to the head, notice the not-empty condition, 120 ad stop. However, RX packets may still continue arriving on the wire. 121 The spidernet chip can save some limited number of these in local RAM. 122 When this local ram fills up, the spider chip will issue an interrupt 123 indicating this (GHIINT0STS will show ERRINT, and the GRMFLLINT bit 124 will be set in GHIINT1STS). When the RX ram full condition occurs, 125 a certain bug/feature is triggered that has to be specially handled. 126 This section describes the special handling for this condition. 127 128 When the OS finally has a chance to run, it will empty out the RX ring. 129 In particular, it will clear the descriptor on which the hardware had 130 stopped. However, once the hardware has decided that a certain 131 descriptor is invalid, it will not restart at that descriptor; instead 132 it will restart at the next descr. This potentially will lead to a 133 deadlock condition, as the tail pointer will be pointing at this descr, 134 which, from the OS point of view, is empty; the OS will be waiting for 135 this descr to be filled. However, the hardware has skipped this descr, 136 and is filling the next descrs. Since the OS doesn't see this, there 137 is a potential deadlock, with the OS waiting for one descr to fill, 138 while the hardware is waiting for a different set of descrs to become 139 empty. 140 141 A call to show_rx_chain() at this point indicates the nature of the 142 problem. A typical print when the network is hung shows the following:: 143 144 net eth1: Spider RX RAM full, incoming packets might be discarded! 145 net eth1: Total number of descrs=256 146 net eth1: Chain tail located at descr=255 147 net eth1: Chain head is at 255 148 net eth1: HW curr desc (GDACTDPA) is at 0 149 net eth1: Have 1 descrs with stat=xa0800000 150 net eth1: HW next desc (GDACNEXTDA) is at 1 151 net eth1: Have 127 descrs with stat=x40800101 152 net eth1: Have 1 descrs with stat=x40800001 153 net eth1: Have 126 descrs with stat=x40800101 154 net eth1: Last 1 descrs with stat=xa0800000 155 156 Both the tail and head pointers are pointing at descr 255, which is 157 marked xa... which is "empty". Thus, from the OS point of view, there 158 is nothing to be done. In particular, there is the implicit assumption 159 that everything in front of the "empty" descr must surely also be empty, 160 as explained in the last section. The OS is waiting for descr 255 to 161 become non-empty, which, in this case, will never happen. 162 163 The HW pointer is at descr 0. This descr is marked 0x4.. or "full". 164 Since its already full, the hardware can do nothing more, and thus has 165 halted processing. Notice that descrs 0 through 254 are all marked 166 "full", while descr 254 and 255 are empty. (The "Last 1 descrs" is 167 descr 254, since tail was at 255.) Thus, the system is deadlocked, 168 and there can be no forward progress; the OS thinks there's nothing 169 to do, and the hardware has nowhere to put incoming data. 170 171 This bug/feature is worked around with the spider_net_resync_head_ptr() 172 routine. When the driver receives RX interrupts, but an examination 173 of the RX chain seems to show it is empty, then it is probable that 174 the hardware has skipped a descr or two (sometimes dozens under heavy 175 network conditions). The spider_net_resync_head_ptr() subroutine will 176 search the ring for the next full descr, and the driver will resume 177 operations there. Since this will leave "holes" in the ring, there 178 is also a spider_net_resync_tail_ptr() that will skip over such holes. 179 180 As of this writing, the spider_net_resync() strategy seems to work very 181 well, even under heavy network loads. 182 183 184 The TX ring 185 =========== 186 The TX ring uses a low-watermark interrupt scheme to make sure that 187 the TX queue is appropriately serviced for large packet sizes. 188 189 For packet sizes greater than about 1KBytes, the kernel can fill 190 the TX ring quicker than the device can drain it. Once the ring 191 is full, the netdev is stopped. When there is room in the ring, 192 the netdev needs to be reawakened, so that more TX packets are placed 193 in the ring. The hardware can empty the ring about four times per jiffy, 194 so its not appropriate to wait for the poll routine to refill, since 195 the poll routine runs only once per jiffy. The low-watermark mechanism 196 marks a descr about 1/4th of the way from the bottom of the queue, so 197 that an interrupt is generated when the descr is processed. This 198 interrupt wakes up the netdev, which can then refill the queue. 199 For large packets, this mechanism generates a relatively small number 200 of interrupts, about 1K/sec. For smaller packets, this will drop to zero 201 interrupts, as the hardware can empty the queue faster than the kernel 202 can fill it.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.