~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/driver-api/nvdimm/btt.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/driver-api/nvdimm/btt.rst (Version linux-6.12-rc7) and /Documentation/driver-api/nvdimm/btt.rst (Version linux-4.17.19)


  1 =============================                     
  2 BTT - Block Translation Table                     
  3 =============================                     
  4                                                   
  5                                                   
  6 1. Introduction                                   
  7 ===============                                   
  8                                                   
  9 Persistent memory based storage is able to per    
 10 accurately, cache line) granularity. However,     
 11 storage as traditional block devices. The bloc    
 12 will do exactly this. However, they do not pro    
 13 Traditional SSDs typically provide protection     
 14 using stored energy in capacitors to complete     
 15 in firmware. We don't have this luxury with pe    
 16 progress, and we experience a power failure, t    
 17 and new data. Applications may not be prepared    
 18                                                   
 19 The Block Translation Table (BTT) provides ato    
 20 persistent memory devices, so that application    
 21 being torn can continue to do so. The BTT mani    
 22 device, and reserves a portion of the underlyi    
 23 the heart of it, is an indirection table that     
 24 volume. It can be thought of as an extremely s    
 25 provides atomic sector updates.                   
 26                                                   
 27                                                   
 28 2. Static Layout                                  
 29 ================                                  
 30                                                   
 31 The underlying storage on which a BTT can be l    
 32 The BTT, however, splits the available space i    
 33 called "Arenas".                                  
 34                                                   
 35 Each arena follows the same layout for its met    
 36 arena are internal to it (with the exception o    
 37 next arena). The following depicts the "On-dis    
 38                                                   
 39                                                   
 40     Backing Store     +------->  Arena            
 41   +---------------+   |   +------------------+    
 42   |               |   |   | Arena info block |    
 43   |    Arena 0    +---+   |       4K         |    
 44   |     512G      |       +------------------+    
 45   |               |       |                  |    
 46   +---------------+       |                  |    
 47   |               |       |                  |    
 48   |    Arena 1    |       |   Data Blocks    |    
 49   |     512G      |       |                  |    
 50   |               |       |                  |    
 51   +---------------+       |                  |    
 52   |       .       |       |                  |    
 53   |       .       |       |                  |    
 54   |       .       |       |                  |    
 55   |               |       |                  |    
 56   |               |       |                  |    
 57   +---------------+       +------------------+    
 58                           |                  |    
 59                           |     BTT Map      |    
 60                           |                  |    
 61                           |                  |    
 62                           +------------------+    
 63                           |                  |    
 64                           |     BTT Flog     |    
 65                           |                  |    
 66                           +------------------+    
 67                           | Info block copy  |    
 68                           |       4K         |    
 69                           +------------------+    
 70                                                   
 71                                                   
 72 3. Theory of Operation                            
 73 ======================                            
 74                                                   
 75                                                   
 76 a. The BTT Map                                    
 77 --------------                                    
 78                                                   
 79 The map is a simple lookup/indirection table t    
 80 block. Each map entry is 32 bits. The two most    
 81 flags, and the remaining form the internal blo    
 82                                                   
 83 ======== =====================================    
 84 Bit      Description                              
 85 ======== =====================================    
 86 31 - 30  Error and Zero flags - Used in the fo    
 87                                                   
 88            == ==  ============================    
 89            31 30  Description                     
 90            == ==  ============================    
 91            0  0   Initial state. Reads return     
 92            0  1   Zero state: Reads return zer    
 93            1  0   Error state: Reads fail; Wri    
 94            1  1   Normal Block – has valid p    
 95            == ==  ============================    
 96                                                   
 97 29 - 0   Mappings to internal 'postmap' blocks    
 98 ======== =====================================    
 99                                                   
100                                                   
101 Some of the terminology that will be subsequen    
102                                                   
103 ============    ==============================    
104 External LBA    LBA as made visible to upper l    
105 ABA             Arena Block Address - Block of    
106 Premap ABA      The block offset into an arena    
107                 checking the External LBA         
108 Postmap ABA     The block number in the "Data     
109                 indirection from the map          
110 nfree           The number of free blocks that    
111                 This is the number of concurre    
112                 arena.                            
113 ============    ==============================    
114                                                   
115                                                   
116 For example, after adding a BTT, we surface a     
117 the external LBA at 768G. This falls into the     
118 worth of blocks that this arena contributes, t    
119 premap ABA is 256G. We now refer to the map, a    
120 'X' (256G) points to block 'Y', say '64'. Thus    
121                                                   
122                                                   
123 b. The BTT Flog                                   
124 ---------------                                   
125                                                   
126 The BTT provides sector atomicity by making ev    
127 i.e. Every write goes to a "free" block. A run    
128 maintained in the form of the BTT flog. 'Flog'    
129 "free list" and "log". The flog contains 'nfre    
130                                                   
131 ========  ====================================    
132 lba       The premap ABA that is being written    
133 old_map   The old postmap ABA - after 'this' w    
134           free block.                             
135 new_map   The new postmap ABA. The map will up    
136           lba->postmap_aba mapping, but we log    
137           recover.                                
138 seq       Sequence number to mark which of the    
139           valid/newest. It cycles between 01->    
140           operation, with 00 indicating an uni    
141 lba'      alternate lba entry                     
142 old_map'  alternate old postmap entry             
143 new_map'  alternate new postmap entry             
144 seq'      alternate sequence number.              
145 ========  ====================================    
146                                                   
147 Each of the above fields is 32-bit, making one    
148 padded to 64 bytes to avoid cache line sharing    
149 done such that for any entry being written, it    
150 a. overwrites the 'old' section in the entry b    
151 b. writes the 'new' section such that the sequ    
152                                                   
153                                                   
154 c. The concept of lanes                           
155 -----------------------                           
156                                                   
157 While 'nfree' describes the number of concurre    
158 concurrently, 'nlanes' is the number of IOs th    
159 process::                                         
160                                                   
161         nlanes = min(nfree, num_cpus)             
162                                                   
163 A lane number is obtained at the start of any     
164 all the on-disk and in-memory data structures     
165 there are more CPUs than the max number of ava    
166 protected by spinlocks.                           
167                                                   
168                                                   
169 d. In-memory data structure: Read Tracking Tab    
170 ----------------------------------------------    
171                                                   
172 Consider a case where we have two threads, one    
173 writes. We can hit a condition where the write    
174 a new IO, but the (slow) reader thread is stil    
175 the reader consulted a map entry, and started     
176 writer started writing to the same external LB    
177 the map for that external LBA to point to its     
178 internal, postmap block that the reader is (st    
179 into the list of free blocks. If another write    
180 grab this free block, and start writing to it,    
181 incorrect data. To prevent this, we introduce     
182                                                   
183 The RTT is a simple, per arena table with 'nfr    
184 into rtt[lane_number], the postmap ABA it is r    
185 read is complete. Every writer thread, after g    
186 RTT for its presence. If the postmap free bloc    
187 reader clears the RTT entry, and only then sta    
188                                                   
189                                                   
190 e. In-memory data structure: map locks            
191 --------------------------------------            
192                                                   
193 Consider a case where two writer threads are w    
194 be a race in the following sequence of steps::    
195                                                   
196         free[lane] = map[premap_aba]              
197         map[premap_aba] = postmap_aba             
198                                                   
199 Both threads can update their respective free[    
200 postmap_aba. This has made the layout inconsis    
201 at the same time, duplicating another free ent    
202                                                   
203 To solve this, we could have a single map lock    
204 before performing the above sequence, but we f    
205 Instead we use an array of (nfree) map_locks t    
206 (premap_aba modulo nfree).                        
207                                                   
208                                                   
209 f. Reconstruction from the Flog                   
210 -------------------------------                   
211                                                   
212 On startup, we analyze the BTT flog to create     
213 through all the entries, and for each lane, of    
214 'sections', we always look at the most recent     
215 number). The reconstruction rules/steps are si    
216                                                   
217 - Read map[log_entry.lba].                        
218 - If log_entry.new matches the map entry, then    
219 - If log_entry.new does not match the map entr    
220   (This case can only be caused by power-fails    
221                                                   
222                                                   
223 g. Summarizing - Read and Write flows             
224 -------------------------------------             
225                                                   
226 Read:                                             
227                                                   
228 1.  Convert external LBA to arena number + pre    
229 2.  Get a lane (and take lane_lock)               
230 3.  Read map to get the entry for this pre-map    
231 4.  Enter post-map ABA into RTT[lane]             
232 5.  If TRIM flag set in map, return zeroes, an    
233 6.  If ERROR flag set in map, end IO with EIO     
234 7.  Read data from this block                     
235 8.  Remove post-map ABA entry from RTT[lane]      
236 9.  Release lane (and lane_lock)                  
237                                                   
238 Write:                                            
239                                                   
240 1.  Convert external LBA to Arena number + pre    
241 2.  Get a lane (and take lane_lock)               
242 3.  Use lane to index into in-memory free list    
243     index, next sequence number                   
244 4.  Scan the RTT to check if free block is pre    
245 5.  Write data to this free block                 
246 6.  Read map to get the existing post-map ABA     
247 7.  Write flog entry: [premap_aba / old postma    
248 8.  Write new post-map ABA into map.              
249 9.  Write old post-map entry into the free lis    
250 10. Calculate next sequence number and write i    
251 11. Release lane (and lane_lock)                  
252                                                   
253                                                   
254 4. Error Handling                                 
255 =================                                 
256                                                   
257 An arena would be in an error state if any of     
258 irrecoverably, either due to a bug or a media     
259 indicate an error:                                
260                                                   
261 - Info block checksum does not match (and reco    
262 - All internal available blocks are not unique    
263   sum of mapped blocks and free blocks (from t    
264 - Rebuilding free list from the flog reveals m    
265   entries                                         
266 - A map entry is out of bounds                    
267                                                   
268 If any of these error conditions are encounter    
269 only state using a flag in the info block.        
270                                                   
271                                                   
272 5. Usage                                          
273 ========                                          
274                                                   
275 The BTT can be set up on any disk (namespace)     
276 (pmem, or blk mode). The easiest way to set up    
277 'ndctl' utility [1]:                              
278                                                   
279 For example, the ndctl command line to setup a    
280                                                   
281     ndctl create-namespace -f -e namespace0.0     
282                                                   
283 See ndctl create-namespace --help for more opt    
284                                                   
285 [1]: https://github.com/pmem/ndctl                
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php