~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/filesystems/orangefs.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/filesystems/orangefs.rst (Version linux-6.12-rc7) and /Documentation/filesystems/orangefs.rst (Version linux-4.10.17)


  1 .. SPDX-License-Identifier: GPL-2.0               
  2                                                   
  3 ========                                          
  4 ORANGEFS                                          
  5 ========                                          
  6                                                   
  7 OrangeFS is an LGPL userspace scale-out parall    
  8 for large storage problems faced by HPC, BigDa    
  9 Genomics, Bioinformatics.                         
 10                                                   
 11 Orangefs, originally called PVFS, was first de    
 12 Walt Ligon and Eric Blumer as a parallel file     
 13 Virtual Machine (PVM) as part of a NASA grant     
 14 of parallel programs.                             
 15                                                   
 16 Orangefs features include:                        
 17                                                   
 18   * Distributes file data among multiple file     
 19   * Supports simultaneous access by multiple c    
 20   * Stores file data and metadata on servers u    
 21     and access methods                            
 22   * Userspace implementation is easy to instal    
 23   * Direct MPI support                            
 24   * Stateless                                     
 25                                                   
 26                                                   
 27 Mailing List Archives                             
 28 =====================                             
 29                                                   
 30 http://lists.orangefs.org/pipermail/devel_list    
 31                                                   
 32                                                   
 33 Mailing List Submissions                          
 34 ========================                          
 35                                                   
 36 devel@lists.orangefs.org                          
 37                                                   
 38                                                   
 39 Documentation                                     
 40 =============                                     
 41                                                   
 42 http://www.orangefs.org/documentation/            
 43                                                   
 44 Running ORANGEFS On a Single Server               
 45 ===================================               
 46                                                   
 47 OrangeFS is usually run in large installations    
 48 clients, but a complete filesystem can be run     
 49 development and testing.                          
 50                                                   
 51 On Fedora, install orangefs and orangefs-serve    
 52                                                   
 53     dnf -y install orangefs orangefs-server       
 54                                                   
 55 There is an example server configuration file     
 56 /etc/orangefs/orangefs.conf.  Change localhost    
 57 necessary.                                        
 58                                                   
 59 To generate a filesystem to run xfstests again    
 60                                                   
 61 There is an example client configuration file     
 62 single line.  Uncomment it and change the host    
 63 controls clients which use libpvfs2.  This doe    
 64 pvfs2-client-core.                                
 65                                                   
 66 Create the filesystem::                           
 67                                                   
 68     pvfs2-server -f /etc/orangefs/orangefs.con    
 69                                                   
 70 Start the server::                                
 71                                                   
 72     systemctl start orangefs-server               
 73                                                   
 74 Test the server::                                 
 75                                                   
 76     pvfs2-ping -m /pvfsmnt                        
 77                                                   
 78 Start the client.  The module must be compiled    
 79 point::                                           
 80                                                   
 81     systemctl start orangefs-client               
 82                                                   
 83 Mount the filesystem::                            
 84                                                   
 85     mount -t pvfs2 tcp://localhost:3334/orange    
 86                                                   
 87 Userspace Filesystem Source                       
 88 ===========================                       
 89                                                   
 90 http://www.orangefs.org/download                  
 91                                                   
 92 Orangefs versions prior to 2.9.3 would not be     
 93 upstream version of the kernel client.            
 94                                                   
 95                                                   
 96 Building ORANGEFS on a Single Server              
 97 ====================================              
 98                                                   
 99 Where OrangeFS cannot be installed from distri    
100 built from source.                                
101                                                   
102 You can omit --prefix if you don't care that t    
103 in /usr/local.  As of version 2.9.6, OrangeFS     
104 default, we will probably be changing the defa    
105                                                   
106 ::                                                
107                                                   
108     ./configure --prefix=/opt/ofs --with-db-ba    
109                                                   
110     make                                          
111                                                   
112     make install                                  
113                                                   
114 Create an orangefs config file by running pvfs    
115 specifying a target config file. Pvfs2-genconf    
116 through. Generally it works fine to take the d    
117 should use your server's hostname, rather than    
118 it comes to that question::                       
119                                                   
120     /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.co    
121                                                   
122 Create an /etc/pvfs2tab file (localhost is fin    
123                                                   
124     echo tcp://localhost:3334/orangefs /pvfsmn    
125         /etc/pvfs2tab                             
126                                                   
127 Create the mount point you specified in the ta    
128                                                   
129     mkdir /pvfsmnt                                
130                                                   
131 Bootstrap the server::                            
132                                                   
133     /opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.c    
134                                                   
135 Start the server::                                
136                                                   
137     /opt/ofs/sbin/pvfs2-server /etc/pvfs2.conf    
138                                                   
139 Now the server should be running. Pvfs2-ls is     
140 test to verify that the server is running::       
141                                                   
142     /opt/ofs/bin/pvfs2-ls /pvfsmnt                
143                                                   
144 If stuff seems to be working, load the kernel     
145 turn on the client core::                         
146                                                   
147     /opt/ofs/sbin/pvfs2-client -p /opt/ofs/sbi    
148                                                   
149 Mount your filesystem::                           
150                                                   
151     mount -t pvfs2 tcp://`hostname`:3334/orang    
152                                                   
153                                                   
154 Running xfstests                                  
155 ================                                  
156                                                   
157 It is useful to use a scratch filesystem with     
158 done with only one server.                        
159                                                   
160 Make a second copy of the FileSystem section i    
161 file, which is /etc/orangefs/orangefs.conf.  C    
162 Change the ID to something other than the ID o    
163 section (2 is usually a good choice).             
164                                                   
165 Then there are two FileSystem sections: orange    
166                                                   
167 This change should be made before creating the    
168                                                   
169 ::                                                
170                                                   
171     pvfs2-server -f /etc/orangefs/orangefs.con    
172                                                   
173 To run xfstests, create /etc/xfsqa.config::       
174                                                   
175     TEST_DIR=/orangefs                            
176     TEST_DEV=tcp://localhost:3334/orangefs        
177     SCRATCH_MNT=/scratch                          
178     SCRATCH_DEV=tcp://localhost:3334/scratch      
179                                                   
180 Then xfstests can be run::                        
181                                                   
182     ./check -pvfs2                                
183                                                   
184                                                   
185 Options                                           
186 =======                                           
187                                                   
188 The following mount options are accepted:         
189                                                   
190   acl                                             
191     Allow the use of Access Control Lists on f    
192                                                   
193   intr                                            
194     Some operations between the kernel client     
195     filesystem can be interruptible, such as c    
196     and the setting of tunable parameters.        
197                                                   
198   local_lock                                      
199     Enable posix locking from the perspective     
200     default file_operations lock action is to     
201     locking kicks in if the filesystem is moun    
202     Distributed locking is being worked on for    
203                                                   
204                                                   
205 Debugging                                         
206 =========                                         
207                                                   
208 If you want the debug (GOSSIP) statements in a    
209 source file (inode.c for example) go to syslog    
210                                                   
211   echo inode > /sys/kernel/debug/orangefs/kern    
212                                                   
213 No debugging (the default)::                      
214                                                   
215   echo none > /sys/kernel/debug/orangefs/kerne    
216                                                   
217 Debugging from several source files::             
218                                                   
219   echo inode,dir > /sys/kernel/debug/orangefs/    
220                                                   
221 All debugging::                                   
222                                                   
223   echo all > /sys/kernel/debug/orangefs/kernel    
224                                                   
225 Get a list of all debugging keywords::            
226                                                   
227   cat /sys/kernel/debug/orangefs/debug-help       
228                                                   
229                                                   
230 Protocol between Kernel Module and Userspace      
231 ============================================      
232                                                   
233 Orangefs is a user space filesystem and an ass    
234 We'll just refer to the user space part of Ora    
235 from here on out. Orangefs descends from PVFS,    
236 still uses PVFS for function and variable name    
237 many of the important structures. Function and    
238 the kernel module have been transitioned to "o    
239 Coding Style avoids typedefs, so kernel module    
240 correspond to userspace structures are not typ    
241                                                   
242 The kernel module implements a pseudo device t    
243 can read from and write to. Userspace can also    
244 kernel module through the pseudo device with i    
245                                                   
246 The Bufmap                                        
247 ----------                                        
248                                                   
249 At startup userspace allocates two page-size-a    
250 mlocked memory buffers, one is used for IO and    
251 operations. The IO buffer is 41943040 bytes an    
252 4194304 bytes. Each buffer contains logical ch    
253 a pointer to each buffer is added to its own P    
254 which also describes its total size, as well a    
255 the partitions.                                   
256                                                   
257 A pointer to the IO buffer's PVFS_dev_map_desc    
258 mapping routine in the kernel module with an i    
259 copied from user space to kernel space with co    
260 to initialize the kernel module's "bufmap" (st    
261 then contains:                                    
262                                                   
263   * refcnt                                        
264     - a reference counter                         
265   * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE    
266     partition size, which represents the files    
267     is used for s_blocksize in super blocks.      
268   * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COU    
269     partitions in the IO buffer.                  
270   * desc_shift - log2(desc_size), used for s_b    
271   * total_size - the total size of the IO buff    
272   * page_count - the number of 4096 byte pages    
273   * page_array - a pointer to ``page_count * (    
274     of kcalloced memory. This memory is used a    
275     to each of the pages in the IO buffer thro    
276   * desc_array - a pointer to ``desc_count * (    
277     bytes of kcalloced memory. This memory is     
278                                                   
279       user_desc is the kernel's copy of the IO    
280       structure. user_desc->ptr points to the     
281                                                   
282       ::                                          
283                                                   
284         pages_per_desc = bufmap->desc_size / P    
285         offset = 0                                
286                                                   
287         bufmap->desc_array[0].page_array = &bu    
288         bufmap->desc_array[0].array_count = pa    
289         bufmap->desc_array[0].uaddr = (user_de    
290         offset += 1024                            
291                            .                      
292                            .                      
293                            .                      
294         bufmap->desc_array[9].page_array = &bu    
295         bufmap->desc_array[9].array_count = pa    
296         bufmap->desc_array[9].uaddr = (user_de    
297                                                   
298         offset += 1024                            
299                                                   
300   * buffer_index_array - a desc_count sized ar    
301     indicate which of the IO buffer's partitio    
302   * buffer_index_lock - a spinlock to protect     
303   * readdir_index_array - a five (ORANGEFS_REA    
304     int array used to indicate which of the re    
305     available to use.                             
306   * readdir_index_lock - a spinlock to protect    
307     update.                                       
308                                                   
309 Operations                                        
310 ----------                                        
311                                                   
312 The kernel module builds an "op" (struct orang    
313 needs to communicate with userspace. Part of t    
314 which expresses the request to userspace. Part    
315 contains the "downcall" which expresses the re    
316                                                   
317 The slab allocator is used to keep a cache of     
318                                                   
319 At init time the kernel module defines and ini    
320 and an in_progress hash table to keep track of    
321 in flight at any given time.                      
322                                                   
323 Ops are stateful:                                 
324                                                   
325  * unknown                                        
326             - op was just initialized             
327  * waiting                                        
328             - op is on request_list (upward bo    
329  * inprogr                                        
330             - op is in progress (waiting for d    
331  * serviced                                       
332             - op has matching downcall; ok        
333  * purged                                         
334             - op has to start a timer since cl    
335               exited uncleanly before servicin    
336  * given up                                       
337             - submitter has given up waiting f    
338                                                   
339 When some arbitrary userspace program needs to    
340 filesystem operation on Orangefs (readdir, I/O    
341 an op structure is initialized and tagged with    
342 number. The upcall part of the op is filled ou    
343 passed to the "service_operation" function.       
344                                                   
345 Service_operation changes the op's state to "w    
346 it on the request list, and signals the Orange    
347 function through a wait queue. Userspace is po    
348 and thus becomes aware of the upcall request t    
349                                                   
350 When the Orangefs file_operations.read functio    
351 request list is searched for an op that seems     
352 The op is removed from the request list. The t    
353 the filled-out upcall struct are copy_to_user'    
354                                                   
355 If any of these (and some additional protocol)    
356 the op's state is set to "waiting" and the op     
357 the request list. Otherwise, the op's state is    
358 and the op is hashed on its tag and put onto t    
359 in_progress hash table at the index the tag ha    
360                                                   
361 When userspace has assembled the response to t    
362 writes the response, which includes the distin    
363 the pseudo device in a series of io_vecs. This    
364 file_operations.write_iter function to find th    
365 tag and remove it from the in_progress hash ta    
366 state is not "canceled" or "given up", its sta    
367 The file_operations.write_iter function return    
368 and back to service_operation through wait_for    
369                                                   
370 Service operation returns to its caller with t    
371 part (the response to the upcall) filled out.     
372                                                   
373 The "client-core" is the bridge between the ke    
374 userspace. The client-core is a daemon. The cl    
375 associated watchdog daemon. If the client-core    
376 to die, the watchdog daemon restarts the clien    
377 the client-core is restarted "right away", the    
378 time during such an event that the client-core    
379 can't be triggered by the Orangefs file_operat    
380 Ops that pass through service_operation during    
381 on the wait queue and one attempt is made to r    
382 if the client-core stays dead too long, the ar    
383 trying to use Orangefs will be negatively affe    
384 that can't be serviced will be removed from th    
385 have their states set to "given up". In-progre    
386 be serviced will be removed from the in_progre    
387 have their states set to "given up".              
388                                                   
389 Readdir and I/O ops are atypical with respect     
390                                                   
391   - readdir ops use the smaller of the two pre    
392     memory buffers. The readdir buffer is only    
393     The kernel module obtains an index to a fr    
394     a readdir op. Userspace deposits the resul    
395     and then writes them to back to the pvfs d    
396                                                   
397   - io (read and write) ops use the larger of     
398     pre-partitioned memory buffers. The IO buf    
399     both userspace and the kernel module. The     
400     index to a free partition before launching    
401     deposits write data into the indexed parti    
402     directly by userspace. Userspace deposits     
403     requests into the indexed partition, to be    
404     by the kernel module.                         
405                                                   
406 Responses to kernel requests are all packaged     
407 structs. Besides a few other members, pvfs2_do    
408 union of structs, each of which is associated     
409 response type.                                    
410                                                   
411 The several members outside of the union are:     
412                                                   
413  ``int32_t type``                                 
414     - type of operation.                          
415  ``int32_t status``                               
416     - return code for the operation.              
417  ``int64_t trailer_size``                         
418     - 0 unless readdir operation.                 
419  ``char *trailer_buf``                            
420     - initialized to NULL, used during readdir    
421                                                   
422 The appropriate member inside the union is fil    
423 particular response.                              
424                                                   
425   PVFS2_VFS_OP_FILE_IO                            
426     fill a pvfs2_io_response_t                    
427                                                   
428   PVFS2_VFS_OP_LOOKUP                             
429     fill a PVFS_object_kref                       
430                                                   
431   PVFS2_VFS_OP_CREATE                             
432     fill a PVFS_object_kref                       
433                                                   
434   PVFS2_VFS_OP_SYMLINK                            
435     fill a PVFS_object_kref                       
436                                                   
437   PVFS2_VFS_OP_GETATTR                            
438     fill in a PVFS_sys_attr_s (tons of stuff t    
439     fill in a string with the link target when    
440                                                   
441   PVFS2_VFS_OP_MKDIR                              
442     fill a PVFS_object_kref                       
443                                                   
444   PVFS2_VFS_OP_STATFS                             
445     fill a pvfs2_statfs_response_t with useles    
446     us to know, in a timely fashion, these sta    
447     distributed network filesystem.               
448                                                   
449   PVFS2_VFS_OP_FS_MOUNT                           
450     fill a pvfs2_fs_mount_response_t which is     
451     except its members are in a different orde    
452     with "id".                                    
453                                                   
454   PVFS2_VFS_OP_GETXATTR                           
455     fill a pvfs2_getxattr_response_t              
456                                                   
457   PVFS2_VFS_OP_LISTXATTR                          
458     fill a pvfs2_listxattr_response_t             
459                                                   
460   PVFS2_VFS_OP_PARAM                              
461     fill a pvfs2_param_response_t                 
462                                                   
463   PVFS2_VFS_OP_PERF_COUNT                         
464     fill a pvfs2_perf_count_response_t            
465                                                   
466   PVFS2_VFS_OP_FSKEY                              
467     file a pvfs2_fs_key_response_t                
468                                                   
469   PVFS2_VFS_OP_READDIR                            
470     jamb everything needed to represent a pvfs    
471     the readdir buffer descriptor specified in    
472                                                   
473 Userspace uses writev() on /dev/pvfs2-req to p    
474 made by the kernel side.                          
475                                                   
476 A buffer_list containing:                         
477                                                   
478   - a pointer to the prepared response to the     
479     kernel (struct pvfs2_downcall_t).             
480   - and also, in the case of a readdir request    
481     buffer containing descriptors for the obje    
482     directory.                                    
483                                                   
484 ... is sent to the function (PINT_dev_write_li    
485 the writev.                                       
486                                                   
487 PINT_dev_write_list has a local iovec array: s    
488                                                   
489 The first four elements of io_array are initia    
490 responses::                                       
491                                                   
492   io_array[0].iov_base = address of local vari    
493   io_array[0].iov_len = sizeof(int32_t)           
494                                                   
495   io_array[1].iov_base = address of global var    
496   io_array[1].iov_len = sizeof(int32_t)           
497                                                   
498   io_array[2].iov_base = address of parameter     
499   io_array[2].iov_len = sizeof(int64_t)           
500                                                   
501   io_array[3].iov_base = address of out_downca    
502                          of global variable vf    
503   io_array[3].iov_len = sizeof(pvfs2_downcall_    
504                                                   
505 Readdir responses initialize the fifth element    
506                                                   
507   io_array[4].iov_base = contents of member tr    
508                          from out_downcall mem    
509                          vfs_request              
510   io_array[4].iov_len = contents of member tra    
511                         from out_downcall memb    
512                         vfs_request               
513                                                   
514 Orangefs exploits the dcache in order to avoid    
515 requests to userspace. We keep object inode at    
516 orangefs_inode_getattr. Orangefs_inode_getattr    
517 help it decide whether or not to update an ino    
518 Orangefs keeps private data in an object's ino    
519 timeout value, getattr_time, which allows any     
520 orangefs_inode_getattr to know how long it has    
521 updated. When the object is not new (new == 0)    
522 set (bypass == 0) orangefs_inode_getattr retur    
523 if getattr_time has not timed out. Getattr_tim    
524 inode is updated.                                 
525                                                   
526 Creation of a new object (file, dir, sym-link)    
527 its pathname, resulting in a negative director    
528 A new inode is allocated and associated with t    
529 a negative dentry into a "productive full memb    
530 obtains the new inode from Linux with new_inod    
531 the inode with the dentry by sending the pair     
532 d_instantiate().                                  
533                                                   
534 The evaluation of a pathname for an object res    
535 dentry. If there is no corresponding dentry, o    
536 the dcache. Whenever a dentry is modified or v    
537 short timeout value in the dentry's d_time, an    
538 for that amount of time. Orangefs is a network    
539 can potentially change out-of-band with any pa    
540 instance, so trusting a dentry is risky. The a    
541 dentries is to always obtain the needed inform    
542 least a trip to the client-core, maybe to the     
543 from a dentry is cheap, obtaining it from user    
544 hence the motivation to use the dentry when po    
545                                                   
546 The timeout values d_time and getattr_time are    
547 code is designed to avoid the jiffy-wrap probl    
548                                                   
549     "In general, if the clock may have wrapped    
550     is no way to tell how much time has elapse    
551     and t2 are known to be fairly close, we ca    
552     difference in a way that takes into accoun    
553     clock may have wrapped between times."        
554                                                   
555 from course notes by instructor Andy Wang         
556                                                   
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php