1 .. SPDX-License-Identifier: GPL-2.0 2 3 ======== 4 ORANGEFS 5 ======== 6 7 OrangeFS is an LGPL userspace scale-out parall 8 for large storage problems faced by HPC, BigDa 9 Genomics, Bioinformatics. 10 11 Orangefs, originally called PVFS, was first de 12 Walt Ligon and Eric Blumer as a parallel file 13 Virtual Machine (PVM) as part of a NASA grant 14 of parallel programs. 15 16 Orangefs features include: 17 18 * Distributes file data among multiple file 19 * Supports simultaneous access by multiple c 20 * Stores file data and metadata on servers u 21 and access methods 22 * Userspace implementation is easy to instal 23 * Direct MPI support 24 * Stateless 25 26 27 Mailing List Archives 28 ===================== 29 30 http://lists.orangefs.org/pipermail/devel_list 31 32 33 Mailing List Submissions 34 ======================== 35 36 devel@lists.orangefs.org 37 38 39 Documentation 40 ============= 41 42 http://www.orangefs.org/documentation/ 43 44 Running ORANGEFS On a Single Server 45 =================================== 46 47 OrangeFS is usually run in large installations 48 clients, but a complete filesystem can be run 49 development and testing. 50 51 On Fedora, install orangefs and orangefs-serve 52 53 dnf -y install orangefs orangefs-server 54 55 There is an example server configuration file 56 /etc/orangefs/orangefs.conf. Change localhost 57 necessary. 58 59 To generate a filesystem to run xfstests again 60 61 There is an example client configuration file 62 single line. Uncomment it and change the host 63 controls clients which use libpvfs2. This doe 64 pvfs2-client-core. 65 66 Create the filesystem:: 67 68 pvfs2-server -f /etc/orangefs/orangefs.con 69 70 Start the server:: 71 72 systemctl start orangefs-server 73 74 Test the server:: 75 76 pvfs2-ping -m /pvfsmnt 77 78 Start the client. The module must be compiled 79 point:: 80 81 systemctl start orangefs-client 82 83 Mount the filesystem:: 84 85 mount -t pvfs2 tcp://localhost:3334/orange 86 87 Userspace Filesystem Source 88 =========================== 89 90 http://www.orangefs.org/download 91 92 Orangefs versions prior to 2.9.3 would not be 93 upstream version of the kernel client. 94 95 96 Building ORANGEFS on a Single Server 97 ==================================== 98 99 Where OrangeFS cannot be installed from distri 100 built from source. 101 102 You can omit --prefix if you don't care that t 103 in /usr/local. As of version 2.9.6, OrangeFS 104 default, we will probably be changing the defa 105 106 :: 107 108 ./configure --prefix=/opt/ofs --with-db-ba 109 110 make 111 112 make install 113 114 Create an orangefs config file by running pvfs 115 specifying a target config file. Pvfs2-genconf 116 through. Generally it works fine to take the d 117 should use your server's hostname, rather than 118 it comes to that question:: 119 120 /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.co 121 122 Create an /etc/pvfs2tab file (localhost is fin 123 124 echo tcp://localhost:3334/orangefs /pvfsmn 125 /etc/pvfs2tab 126 127 Create the mount point you specified in the ta 128 129 mkdir /pvfsmnt 130 131 Bootstrap the server:: 132 133 /opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.c 134 135 Start the server:: 136 137 /opt/ofs/sbin/pvfs2-server /etc/pvfs2.conf 138 139 Now the server should be running. Pvfs2-ls is 140 test to verify that the server is running:: 141 142 /opt/ofs/bin/pvfs2-ls /pvfsmnt 143 144 If stuff seems to be working, load the kernel 145 turn on the client core:: 146 147 /opt/ofs/sbin/pvfs2-client -p /opt/ofs/sbi 148 149 Mount your filesystem:: 150 151 mount -t pvfs2 tcp://`hostname`:3334/orang 152 153 154 Running xfstests 155 ================ 156 157 It is useful to use a scratch filesystem with 158 done with only one server. 159 160 Make a second copy of the FileSystem section i 161 file, which is /etc/orangefs/orangefs.conf. C 162 Change the ID to something other than the ID o 163 section (2 is usually a good choice). 164 165 Then there are two FileSystem sections: orange 166 167 This change should be made before creating the 168 169 :: 170 171 pvfs2-server -f /etc/orangefs/orangefs.con 172 173 To run xfstests, create /etc/xfsqa.config:: 174 175 TEST_DIR=/orangefs 176 TEST_DEV=tcp://localhost:3334/orangefs 177 SCRATCH_MNT=/scratch 178 SCRATCH_DEV=tcp://localhost:3334/scratch 179 180 Then xfstests can be run:: 181 182 ./check -pvfs2 183 184 185 Options 186 ======= 187 188 The following mount options are accepted: 189 190 acl 191 Allow the use of Access Control Lists on f 192 193 intr 194 Some operations between the kernel client 195 filesystem can be interruptible, such as c 196 and the setting of tunable parameters. 197 198 local_lock 199 Enable posix locking from the perspective 200 default file_operations lock action is to 201 locking kicks in if the filesystem is moun 202 Distributed locking is being worked on for 203 204 205 Debugging 206 ========= 207 208 If you want the debug (GOSSIP) statements in a 209 source file (inode.c for example) go to syslog 210 211 echo inode > /sys/kernel/debug/orangefs/kern 212 213 No debugging (the default):: 214 215 echo none > /sys/kernel/debug/orangefs/kerne 216 217 Debugging from several source files:: 218 219 echo inode,dir > /sys/kernel/debug/orangefs/ 220 221 All debugging:: 222 223 echo all > /sys/kernel/debug/orangefs/kernel 224 225 Get a list of all debugging keywords:: 226 227 cat /sys/kernel/debug/orangefs/debug-help 228 229 230 Protocol between Kernel Module and Userspace 231 ============================================ 232 233 Orangefs is a user space filesystem and an ass 234 We'll just refer to the user space part of Ora 235 from here on out. Orangefs descends from PVFS, 236 still uses PVFS for function and variable name 237 many of the important structures. Function and 238 the kernel module have been transitioned to "o 239 Coding Style avoids typedefs, so kernel module 240 correspond to userspace structures are not typ 241 242 The kernel module implements a pseudo device t 243 can read from and write to. Userspace can also 244 kernel module through the pseudo device with i 245 246 The Bufmap 247 ---------- 248 249 At startup userspace allocates two page-size-a 250 mlocked memory buffers, one is used for IO and 251 operations. The IO buffer is 41943040 bytes an 252 4194304 bytes. Each buffer contains logical ch 253 a pointer to each buffer is added to its own P 254 which also describes its total size, as well a 255 the partitions. 256 257 A pointer to the IO buffer's PVFS_dev_map_desc 258 mapping routine in the kernel module with an i 259 copied from user space to kernel space with co 260 to initialize the kernel module's "bufmap" (st 261 then contains: 262 263 * refcnt 264 - a reference counter 265 * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE 266 partition size, which represents the files 267 is used for s_blocksize in super blocks. 268 * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COU 269 partitions in the IO buffer. 270 * desc_shift - log2(desc_size), used for s_b 271 * total_size - the total size of the IO buff 272 * page_count - the number of 4096 byte pages 273 * page_array - a pointer to ``page_count * ( 274 of kcalloced memory. This memory is used a 275 to each of the pages in the IO buffer thro 276 * desc_array - a pointer to ``desc_count * ( 277 bytes of kcalloced memory. This memory is 278 279 user_desc is the kernel's copy of the IO 280 structure. user_desc->ptr points to the 281 282 :: 283 284 pages_per_desc = bufmap->desc_size / P 285 offset = 0 286 287 bufmap->desc_array[0].page_array = &bu 288 bufmap->desc_array[0].array_count = pa 289 bufmap->desc_array[0].uaddr = (user_de 290 offset += 1024 291 . 292 . 293 . 294 bufmap->desc_array[9].page_array = &bu 295 bufmap->desc_array[9].array_count = pa 296 bufmap->desc_array[9].uaddr = (user_de 297 298 offset += 1024 299 300 * buffer_index_array - a desc_count sized ar 301 indicate which of the IO buffer's partitio 302 * buffer_index_lock - a spinlock to protect 303 * readdir_index_array - a five (ORANGEFS_REA 304 int array used to indicate which of the re 305 available to use. 306 * readdir_index_lock - a spinlock to protect 307 update. 308 309 Operations 310 ---------- 311 312 The kernel module builds an "op" (struct orang 313 needs to communicate with userspace. Part of t 314 which expresses the request to userspace. Part 315 contains the "downcall" which expresses the re 316 317 The slab allocator is used to keep a cache of 318 319 At init time the kernel module defines and ini 320 and an in_progress hash table to keep track of 321 in flight at any given time. 322 323 Ops are stateful: 324 325 * unknown 326 - op was just initialized 327 * waiting 328 - op is on request_list (upward bo 329 * inprogr 330 - op is in progress (waiting for d 331 * serviced 332 - op has matching downcall; ok 333 * purged 334 - op has to start a timer since cl 335 exited uncleanly before servicin 336 * given up 337 - submitter has given up waiting f 338 339 When some arbitrary userspace program needs to 340 filesystem operation on Orangefs (readdir, I/O 341 an op structure is initialized and tagged with 342 number. The upcall part of the op is filled ou 343 passed to the "service_operation" function. 344 345 Service_operation changes the op's state to "w 346 it on the request list, and signals the Orange 347 function through a wait queue. Userspace is po 348 and thus becomes aware of the upcall request t 349 350 When the Orangefs file_operations.read functio 351 request list is searched for an op that seems 352 The op is removed from the request list. The t 353 the filled-out upcall struct are copy_to_user' 354 355 If any of these (and some additional protocol) 356 the op's state is set to "waiting" and the op 357 the request list. Otherwise, the op's state is 358 and the op is hashed on its tag and put onto t 359 in_progress hash table at the index the tag ha 360 361 When userspace has assembled the response to t 362 writes the response, which includes the distin 363 the pseudo device in a series of io_vecs. This 364 file_operations.write_iter function to find th 365 tag and remove it from the in_progress hash ta 366 state is not "canceled" or "given up", its sta 367 The file_operations.write_iter function return 368 and back to service_operation through wait_for 369 370 Service operation returns to its caller with t 371 part (the response to the upcall) filled out. 372 373 The "client-core" is the bridge between the ke 374 userspace. The client-core is a daemon. The cl 375 associated watchdog daemon. If the client-core 376 to die, the watchdog daemon restarts the clien 377 the client-core is restarted "right away", the 378 time during such an event that the client-core 379 can't be triggered by the Orangefs file_operat 380 Ops that pass through service_operation during 381 on the wait queue and one attempt is made to r 382 if the client-core stays dead too long, the ar 383 trying to use Orangefs will be negatively affe 384 that can't be serviced will be removed from th 385 have their states set to "given up". In-progre 386 be serviced will be removed from the in_progre 387 have their states set to "given up". 388 389 Readdir and I/O ops are atypical with respect 390 391 - readdir ops use the smaller of the two pre 392 memory buffers. The readdir buffer is only 393 The kernel module obtains an index to a fr 394 a readdir op. Userspace deposits the resul 395 and then writes them to back to the pvfs d 396 397 - io (read and write) ops use the larger of 398 pre-partitioned memory buffers. The IO buf 399 both userspace and the kernel module. The 400 index to a free partition before launching 401 deposits write data into the indexed parti 402 directly by userspace. Userspace deposits 403 requests into the indexed partition, to be 404 by the kernel module. 405 406 Responses to kernel requests are all packaged 407 structs. Besides a few other members, pvfs2_do 408 union of structs, each of which is associated 409 response type. 410 411 The several members outside of the union are: 412 413 ``int32_t type`` 414 - type of operation. 415 ``int32_t status`` 416 - return code for the operation. 417 ``int64_t trailer_size`` 418 - 0 unless readdir operation. 419 ``char *trailer_buf`` 420 - initialized to NULL, used during readdir 421 422 The appropriate member inside the union is fil 423 particular response. 424 425 PVFS2_VFS_OP_FILE_IO 426 fill a pvfs2_io_response_t 427 428 PVFS2_VFS_OP_LOOKUP 429 fill a PVFS_object_kref 430 431 PVFS2_VFS_OP_CREATE 432 fill a PVFS_object_kref 433 434 PVFS2_VFS_OP_SYMLINK 435 fill a PVFS_object_kref 436 437 PVFS2_VFS_OP_GETATTR 438 fill in a PVFS_sys_attr_s (tons of stuff t 439 fill in a string with the link target when 440 441 PVFS2_VFS_OP_MKDIR 442 fill a PVFS_object_kref 443 444 PVFS2_VFS_OP_STATFS 445 fill a pvfs2_statfs_response_t with useles 446 us to know, in a timely fashion, these sta 447 distributed network filesystem. 448 449 PVFS2_VFS_OP_FS_MOUNT 450 fill a pvfs2_fs_mount_response_t which is 451 except its members are in a different orde 452 with "id". 453 454 PVFS2_VFS_OP_GETXATTR 455 fill a pvfs2_getxattr_response_t 456 457 PVFS2_VFS_OP_LISTXATTR 458 fill a pvfs2_listxattr_response_t 459 460 PVFS2_VFS_OP_PARAM 461 fill a pvfs2_param_response_t 462 463 PVFS2_VFS_OP_PERF_COUNT 464 fill a pvfs2_perf_count_response_t 465 466 PVFS2_VFS_OP_FSKEY 467 file a pvfs2_fs_key_response_t 468 469 PVFS2_VFS_OP_READDIR 470 jamb everything needed to represent a pvfs 471 the readdir buffer descriptor specified in 472 473 Userspace uses writev() on /dev/pvfs2-req to p 474 made by the kernel side. 475 476 A buffer_list containing: 477 478 - a pointer to the prepared response to the 479 kernel (struct pvfs2_downcall_t). 480 - and also, in the case of a readdir request 481 buffer containing descriptors for the obje 482 directory. 483 484 ... is sent to the function (PINT_dev_write_li 485 the writev. 486 487 PINT_dev_write_list has a local iovec array: s 488 489 The first four elements of io_array are initia 490 responses:: 491 492 io_array[0].iov_base = address of local vari 493 io_array[0].iov_len = sizeof(int32_t) 494 495 io_array[1].iov_base = address of global var 496 io_array[1].iov_len = sizeof(int32_t) 497 498 io_array[2].iov_base = address of parameter 499 io_array[2].iov_len = sizeof(int64_t) 500 501 io_array[3].iov_base = address of out_downca 502 of global variable vf 503 io_array[3].iov_len = sizeof(pvfs2_downcall_ 504 505 Readdir responses initialize the fifth element 506 507 io_array[4].iov_base = contents of member tr 508 from out_downcall mem 509 vfs_request 510 io_array[4].iov_len = contents of member tra 511 from out_downcall memb 512 vfs_request 513 514 Orangefs exploits the dcache in order to avoid 515 requests to userspace. We keep object inode at 516 orangefs_inode_getattr. Orangefs_inode_getattr 517 help it decide whether or not to update an ino 518 Orangefs keeps private data in an object's ino 519 timeout value, getattr_time, which allows any 520 orangefs_inode_getattr to know how long it has 521 updated. When the object is not new (new == 0) 522 set (bypass == 0) orangefs_inode_getattr retur 523 if getattr_time has not timed out. Getattr_tim 524 inode is updated. 525 526 Creation of a new object (file, dir, sym-link) 527 its pathname, resulting in a negative director 528 A new inode is allocated and associated with t 529 a negative dentry into a "productive full memb 530 obtains the new inode from Linux with new_inod 531 the inode with the dentry by sending the pair 532 d_instantiate(). 533 534 The evaluation of a pathname for an object res 535 dentry. If there is no corresponding dentry, o 536 the dcache. Whenever a dentry is modified or v 537 short timeout value in the dentry's d_time, an 538 for that amount of time. Orangefs is a network 539 can potentially change out-of-band with any pa 540 instance, so trusting a dentry is risky. The a 541 dentries is to always obtain the needed inform 542 least a trip to the client-core, maybe to the 543 from a dentry is cheap, obtaining it from user 544 hence the motivation to use the dentry when po 545 546 The timeout values d_time and getattr_time are 547 code is designed to avoid the jiffy-wrap probl 548 549 "In general, if the clock may have wrapped 550 is no way to tell how much time has elapse 551 and t2 are known to be fairly close, we ca 552 difference in a way that takes into accoun 553 clock may have wrapped between times." 554 555 from course notes by instructor Andy Wang 556
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.