~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/driver-api/md/md-cluster.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/driver-api/md/md-cluster.rst (Version linux-6.12-rc7) and /Documentation/driver-api/md/md-cluster.rst (Version linux-4.10.17)


  1 ==========                                        
  2 MD Cluster                                        
  3 ==========                                        
  4                                                   
  5 The cluster MD is a shared-device RAID for a c    
  6 two levels: raid1 and raid10 (limited support)    
  7                                                   
  8                                                   
  9 1. On-disk format                                 
 10 =================                                 
 11                                                   
 12 Separate write-intent-bitmaps are used for eac    
 13 The bitmaps record all writes that may have be    
 14 and may not yet have finished. The on-disk lay    
 15                                                   
 16   0                    4k                         
 17   --------------------------------------------    
 18   | idle                | md super                
 19   | bm bits[0, contd]   | bm super[1] + bits      
 20   | bm super[2] + bits  | bm bits [2, contd]      
 21   | bm bits [3, contd]  |                         
 22                                                   
 23 During "normal" functioning we assume the file    
 24 one node writes to any given block at a time,     
 25                                                   
 26  - set the appropriate bit (if not already set    
 27  - commit the write to all mirrors                
 28  - schedule the bit to be cleared after a time    
 29                                                   
 30 Reads are just handled normally. It is up to t    
 31 one node doesn't read from a location where an    
 32 node) is writing.                                 
 33                                                   
 34                                                   
 35 2. DLM Locks for management                       
 36 ===========================                       
 37                                                   
 38 There are three groups of locks for managing t    
 39                                                   
 40 2.1 Bitmap lock resource (bm_lockres)             
 41 -------------------------------------             
 42                                                   
 43  The bm_lockres protects individual node bitma    
 44  the form bitmap000 for node 1, bitmap001 for     
 45  node joins the cluster, it acquires the lock     
 46  so during the lifetime the node is part of th    
 47  resource number is based on the slot number r    
 48  subsystem. Since DLM starts node count from o    
 49  start from zero, one is subtracted from the D    
 50  at the bitmap slot number.                       
 51                                                   
 52  The LVB of the bitmap lock for a particular n    
 53  of sectors that are being re-synced by that n    
 54  node may write to those sectors.  This is use    
 55  joins the cluster.                               
 56                                                   
 57 2.2 Message passing locks                         
 58 -------------------------                         
 59                                                   
 60  Each node has to communicate with other nodes    
 61  resync, and for metadata superblock updates.     
 62  managed through three locks: "token", "messag    
 63  with the Lock Value Block (LVB) of one of the    
 64                                                   
 65 2.3 new-device management                         
 66 -------------------------                         
 67                                                   
 68  A single lock: "no-new-dev" is used to coordi    
 69  new devices - this must be synchronized acros    
 70  Normally all nodes hold a concurrent-read loc    
 71                                                   
 72 3. Communication                                  
 73 ================                                  
 74                                                   
 75  Messages can be broadcast to all nodes, and t    
 76  other nodes to acknowledge the message before    
 77  message can be processed at a time.              
 78                                                   
 79 3.1 Message Types                                 
 80 -----------------                                 
 81                                                   
 82  There are six types of messages which are pas    
 83                                                   
 84 3.1.1 METADATA_UPDATED                            
 85 ^^^^^^^^^^^^^^^^^^^^^^                            
 86                                                   
 87    informs other nodes that the metadata has      
 88    been updated, and the node must re-read the    
 89    performed synchronously. It is primarily us    
 90    failure.                                       
 91                                                   
 92 3.1.2 RESYNCING                                   
 93 ^^^^^^^^^^^^^^^                                   
 94    informs other nodes that a resync is initia    
 95    ended so that each node may suspend or resu    
 96    RESYNCING message identifies a range of the    
 97    sending node is about to resync. This overr    
 98    notification from that node: only one range    
 99    time per-node.                                 
100                                                   
101 3.1.3 NEWDISK                                     
102 ^^^^^^^^^^^^^                                     
103                                                   
104    informs other nodes that a device is being     
105    the array. Message contains an identifier f    
106    below for further details.                     
107                                                   
108 3.1.4 REMOVE                                      
109 ^^^^^^^^^^^^                                      
110                                                   
111    A failed or spare device is being removed f    
112    array. The slot-number of the device is inc    
113                                                   
114  3.1.5 RE_ADD:                                    
115                                                   
116    A failed device is being re-activated - the    
117    is that it has been determined to be workin    
118                                                   
119  3.1.6 BITMAP_NEEDS_SYNC:                         
120                                                   
121    If a node is stopped locally but the bitmap    
122    isn't clean, then another node is informed     
123    resync.                                        
124                                                   
125 3.2 Communication mechanism                       
126 ---------------------------                       
127                                                   
128  The DLM LVB is used to communicate within nod    
129  are three resources used for the purpose:        
130                                                   
131 3.2.1 token                                       
132 ^^^^^^^^^^^                                       
133    The resource which protects the entire comm    
134    system. The node having the token resource     
135    communicate.                                   
136                                                   
137 3.2.2 message                                     
138 ^^^^^^^^^^^^^                                     
139    The lock resource which carries the data to    
140                                                   
141 3.2.3 ack                                         
142 ^^^^^^^^^                                         
143                                                   
144    The resource, acquiring which means the mes    
145    acknowledged by all nodes in the cluster. T    
146    is used to inform the receiving node that a    
147    communicate.                                   
148                                                   
149 The algorithm is:                                 
150                                                   
151  1. receive status - all nodes have concurrent    
152                                                   
153         sender                         receive    
154         "ack":CR                       "ack":C    
155                                                   
156  2. sender get EX on "token",                     
157     sender get EX on "message"::                  
158                                                   
159         sender                        receiver    
160         "token":EX                    "ack":CR    
161         "message":EX                              
162         "ack":CR                                  
163                                                   
164     Sender checks that it still needs to send     
165     received or other events that happened whi    
166     "token" may have made this message inappro    
167                                                   
168  3. sender writes LVB                             
169                                                   
170     sender down-convert "message" from EX to C    
171                                                   
172     sender try to get EX of "ack"                 
173                                                   
174     ::                                            
175                                                   
176       [ wait until all receivers have *process    
177                                                   
178                                        [ trigg    
179                                        receive    
180                                        receive    
181                                        receive    
182                                        [ wait     
183                                        receive    
184                                        receive    
185                                                   
186      sender                         receiver      
187      "token":EX                     "message":    
188      "message":CW                                 
189      "ack":EX                                     
190                                                   
191  4. triggered by grant of EX on "ack" (indicat    
192     have processed message)                       
193                                                   
194     sender down-converts "ack" from EX to CR      
195                                                   
196     sender releases "message"                     
197                                                   
198     sender releases "token"                       
199                                                   
200     ::                                            
201                                                   
202                                  receiver upco    
203                                  receiver get     
204                                  receiver rele    
205                                                   
206      sender                      receiver         
207      "ack":CR                    "ack":CR         
208                                                   
209                                                   
210 4. Handling Failures                              
211 ====================                              
212                                                   
213 4.1 Node Failure                                  
214 ----------------                                  
215                                                   
216  When a node fails, the DLM informs the cluste    
217  number. The node starts a cluster recovery th    
218  recovery thread:                                 
219                                                   
220         - acquires the bitmap<number> lock of     
221         - opens the bitmap                        
222         - reads the bitmap of the failed node     
223         - copies the set bitmap to local node     
224         - cleans the bitmap of the failed node    
225         - releases bitmap<number> lock of the     
226         - initiates resync of the bitmap on th    
227           md_check_recovery is invoked within     
228           then md_check_recovery -> metadata_u    
229           it will lock the communication by lo    
230           Which means when one node is resynci    
231           other nodes from writing anywhere on    
232                                                   
233  The resync process is the regular md resync.     
234  environment when a resync is performed, it ne    
235  of the areas which are suspended. Before a re    
236  send out RESYNCING with the (lo,hi) range of     
237  be suspended. Each node maintains a suspend_l    
238  list of ranges which are currently suspended.    
239  the node adds the range to the suspend_list.     
240  performing resync finishes, it sends RESYNCIN    
241  other nodes and other nodes remove the corres    
242  suspend_list.                                    
243                                                   
244  A helper function, ->area_resyncing() can be     
245  particular I/O range should be suspended or n    
246                                                   
247 4.2 Device Failure                                
248 ==================                                
249                                                   
250  Device failures are handled and communicated     
251  routine.  When a node detects a device failur    
252  any further writes to that device until the f    
253  acknowledged by all other nodes.                 
254                                                   
255 5. Adding a new Device                            
256 ----------------------                            
257                                                   
258  For adding a new device, it is necessary that    
259  device to be added. For this, the following a    
260                                                   
261    1.  Node 1 issues mdadm --manage /dev/mdX -    
262        ioctl(ADD_NEW_DISK with disc.state set     
263    2.  Node 1 sends a NEWDISK message with uui    
264    3.  Other nodes issue kobject_uevent_env wi    
265        (Steps 4,5 could be a udev rule)           
266    4.  In userspace, the node searches for the    
267        using blkid -t SUB_UUID=""                 
268    5.  Other nodes issue either of the followi    
269        the disk was found:                        
270        ioctl(ADD_NEW_DISK with disc.state set     
271        disc.number set to slot number)            
272        ioctl(CLUSTERED_DISK_NACK)                 
273    6.  Other nodes drop lock on "no-new-devs"     
274    7.  Node 1 attempts EX lock on "no-new-dev"    
275    8.  If node 1 gets the lock, it sends METAD    
276        unmarking the disk as SpareLocal           
277    9.  If not (get "no-new-dev" lock), it fail    
278        METADATA_UPDATED.                          
279    10. Other nodes get the information whether    
280        by the following METADATA_UPDATED.         
281                                                   
282 6. Module interface                               
283 ===================                               
284                                                   
285  There are 17 call-backs which the md core can    
286  module.  Understanding these can give a good     
287  process.                                         
288                                                   
289 6.1 join(nodes) and leave()                       
290 ---------------------------                       
291                                                   
292  These are called when an array is started wit    
293  and when the array is stopped.  join() ensure    
294  available and initializes the various resourc    
295  Only the first 'nodes' nodes in the cluster c    
296                                                   
297 6.2 slot_number()                                 
298 -----------------                                 
299                                                   
300  Reports the slot number advised by the cluste    
301  Range is from 0 to nodes-1.                      
302                                                   
303 6.3 resync_info_update()                          
304 ------------------------                          
305                                                   
306  This updates the resync range that is stored     
307  The starting point is updated as the resync p    
308  end point is always the end of the array.        
309  It does *not* send a RESYNCING message.          
310                                                   
311 6.4 resync_start(), resync_finish()               
312 -----------------------------------               
313                                                   
314  These are called when resync/recovery/reshape    
315  They update the resyncing range in the bitmap    
316  send a RESYNCING message.  resync_start repor    
317  array as resyncing, resync_finish reports non    
318                                                   
319  resync_finish() also sends a BITMAP_NEEDS_SYN    
320  allows some other node to take over.             
321                                                   
322 6.5 metadata_update_start(), metadata_update_f    
323 ----------------------------------------------    
324                                                   
325  metadata_update_start is used to get exclusiv    
326  the metadata.  If a change is still needed on    
327  gained, metadata_update_finish() will send a     
328  message to all other nodes, otherwise metadat    
329  can be used to release the lock.                 
330                                                   
331 6.6 area_resyncing()                              
332 --------------------                              
333                                                   
334  This combines two elements of functionality.     
335                                                   
336  Firstly, it will check if any node is current    
337  anything in a given range of sectors.  If any    
338  then the caller will avoid writing or read-ba    
339  range.                                           
340                                                   
341  Secondly, while node recovery is happening it    
342  all areas are resyncing for READ requests.  T    
343  between the cluster-filesystem and the cluste    
344  a node failure.                                  
345                                                   
346 6.7 add_new_disk_start(), add_new_disk_finish(    
347 ----------------------------------------------    
348                                                   
349  These are used to manage the new-disk protoco    
350  When a new device is added, add_new_disk_star    
351  it is bound to the array and, if that succeed    
352  is called the device is fully added.             
353                                                   
354  When a device is added in acknowledgement to     
355  request, or when the device is declared "unav    
356  new_disk_ack() is called.                        
357                                                   
358 6.8 remove_disk()                                 
359 -----------------                                 
360                                                   
361  This is called when a spare or failed device     
362  the array.  It causes a REMOVE message to be     
363                                                   
364 6.9 gather_bitmaps()                              
365 --------------------                              
366                                                   
367  This sends a RE_ADD message to all other node    
368  gathers bitmap information from all bitmaps.     
369  bitmap is then used to recovery the re-added     
370                                                   
371 6.10 lock_all_bitmaps() and unlock_all_bitmaps    
372 ----------------------------------------------    
373                                                   
374  These are called when change bitmap to none.     
375  to clear the cluster raid's bitmap, it need t    
376  nodes are using the raid which is achieved by    
377  locks within the cluster, and also those lock    
378  accordingly.                                     
379                                                   
380 7. Unsupported features                           
381 =======================                           
382                                                   
383 There are somethings which are not supported b    
384                                                   
385 - change array_sectors.                           
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php