1 ========= 1 ========= 2 dm-switch 2 dm-switch 3 ========= 3 ========= 4 4 5 The device-mapper switch target creates a devi 5 The device-mapper switch target creates a device that supports an 6 arbitrary mapping of fixed-size regions of I/O 6 arbitrary mapping of fixed-size regions of I/O across a fixed set of 7 paths. The path used for any specific region 7 paths. The path used for any specific region can be switched 8 dynamically by sending the target a message. 8 dynamically by sending the target a message. 9 9 10 It maps I/O to underlying block devices effici 10 It maps I/O to underlying block devices efficiently when there is a large 11 number of fixed-sized address regions but ther 11 number of fixed-sized address regions but there is no simple pattern 12 that would allow for a compact representation 12 that would allow for a compact representation of the mapping such as 13 dm-stripe. 13 dm-stripe. 14 14 15 Background 15 Background 16 ---------- 16 ---------- 17 17 18 Dell EqualLogic and some other iSCSI storage a 18 Dell EqualLogic and some other iSCSI storage arrays use a distributed 19 frameless architecture. In this architecture, 19 frameless architecture. In this architecture, the storage group 20 consists of a number of distinct storage array 20 consists of a number of distinct storage arrays ("members") each having 21 independent controllers, disk storage and netw 21 independent controllers, disk storage and network adapters. When a LUN 22 is created it is spread across multiple member 22 is created it is spread across multiple members. The details of the 23 spreading are hidden from initiators connected 23 spreading are hidden from initiators connected to this storage system. 24 The storage group exposes a single target disc 24 The storage group exposes a single target discovery portal, no matter 25 how many members are being used. When iSCSI s 25 how many members are being used. When iSCSI sessions are created, each 26 session is connected to an eth port on a singl 26 session is connected to an eth port on a single member. Data to a LUN 27 can be sent on any iSCSI session, and if the b 27 can be sent on any iSCSI session, and if the blocks being accessed are 28 stored on another member the I/O will be forwa 28 stored on another member the I/O will be forwarded as required. This 29 forwarding is invisible to the initiator. The 29 forwarding is invisible to the initiator. The storage layout is also 30 dynamic, and the blocks stored on disk may be 30 dynamic, and the blocks stored on disk may be moved from member to 31 member as needed to balance the load. 31 member as needed to balance the load. 32 32 33 This architecture simplifies the management an 33 This architecture simplifies the management and configuration of both 34 the storage group and initiators. In a multip 34 the storage group and initiators. In a multipathing configuration, it 35 is possible to set up multiple iSCSI sessions 35 is possible to set up multiple iSCSI sessions to use multiple network 36 interfaces on both the host and target to take 36 interfaces on both the host and target to take advantage of the 37 increased network bandwidth. An initiator cou 37 increased network bandwidth. An initiator could use a simple round 38 robin algorithm to send I/O across all paths a 38 robin algorithm to send I/O across all paths and let the storage array 39 members forward it as necessary, but there is 39 members forward it as necessary, but there is a performance advantage to 40 sending data directly to the correct member. 40 sending data directly to the correct member. 41 41 42 A device-mapper table already lets you map dif 42 A device-mapper table already lets you map different regions of a 43 device onto different targets. However in thi 43 device onto different targets. However in this architecture the LUN is 44 spread with an address region size on the orde 44 spread with an address region size on the order of 10s of MBs, which 45 means the resulting table could have more than 45 means the resulting table could have more than a million entries and 46 consume far too much memory. 46 consume far too much memory. 47 47 48 Using this device-mapper switch target we can 48 Using this device-mapper switch target we can now build a two-layer 49 device hierarchy: 49 device hierarchy: 50 50 51 Upper Tier - Determine which array member 51 Upper Tier - Determine which array member the I/O should be sent to. 52 Lower Tier - Load balance amongst paths to 52 Lower Tier - Load balance amongst paths to a particular member. 53 53 54 The lower tier consists of a single dm multipa 54 The lower tier consists of a single dm multipath device for each member. 55 Each of these multipath devices contains the s 55 Each of these multipath devices contains the set of paths directly to 56 the array member in one priority group, and le 56 the array member in one priority group, and leverages existing path 57 selectors to load balance amongst these paths. 57 selectors to load balance amongst these paths. We also build a 58 non-preferred priority group containing paths 58 non-preferred priority group containing paths to other array members for 59 failover reasons. 59 failover reasons. 60 60 61 The upper tier consists of a single dm-switch 61 The upper tier consists of a single dm-switch device. This device uses 62 a bitmap to look up the location of the I/O an 62 a bitmap to look up the location of the I/O and choose the appropriate 63 lower tier device to route the I/O. By using 63 lower tier device to route the I/O. By using a bitmap we are able to 64 use 4 bits for each address range in a 16 memb 64 use 4 bits for each address range in a 16 member group (which is very 65 large for us). This is a much denser represen 65 large for us). This is a much denser representation than the dm table 66 b-tree can achieve. 66 b-tree can achieve. 67 67 68 Construction Parameters 68 Construction Parameters 69 ======================= 69 ======================= 70 70 71 <num_paths> <region_size> <num_optional_ar 71 <num_paths> <region_size> <num_optional_args> [<optional_args>...] [<dev_path> <offset>]+ 72 <num_paths> 72 <num_paths> 73 The number of paths across which t 73 The number of paths across which to distribute the I/O. 74 74 75 <region_size> 75 <region_size> 76 The number of 512-byte sectors in 76 The number of 512-byte sectors in a region. Each region can be redirected 77 to any of the available paths. 77 to any of the available paths. 78 78 79 <num_optional_args> 79 <num_optional_args> 80 The number of optional arguments. 80 The number of optional arguments. Currently, no optional arguments 81 are supported and so this must be 81 are supported and so this must be zero. 82 82 83 <dev_path> 83 <dev_path> 84 The block device that represents a 84 The block device that represents a specific path to the device. 85 85 86 <offset> 86 <offset> 87 The offset of the start of data on 87 The offset of the start of data on the specific <dev_path> (in units 88 of 512-byte sectors). This number 88 of 512-byte sectors). This number is added to the sector number when 89 forwarding the request to the spec 89 forwarding the request to the specific path. Typically it is zero. 90 90 91 Messages 91 Messages 92 ======== 92 ======== 93 93 94 set_region_mappings <index>:<path_nr> [<index> 94 set_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>... 95 95 96 Modify the region table by specifying which re 96 Modify the region table by specifying which regions are redirected to 97 which paths. 97 which paths. 98 98 99 <index> 99 <index> 100 The region number (region size was specifi 100 The region number (region size was specified in constructor parameters). 101 If index is omitted, the next region (prev 101 If index is omitted, the next region (previous index + 1) is used. 102 Expressed in hexadecimal (WITHOUT any pref 102 Expressed in hexadecimal (WITHOUT any prefix like 0x). 103 103 104 <path_nr> 104 <path_nr> 105 The path number in the range 0 ... (<num_p 105 The path number in the range 0 ... (<num_paths> - 1). 106 Expressed in hexadecimal (WITHOUT any pref 106 Expressed in hexadecimal (WITHOUT any prefix like 0x). 107 107 108 R<n>,<m> 108 R<n>,<m> 109 This parameter allows repetitive patterns 109 This parameter allows repetitive patterns to be loaded quickly. <n> and <m> 110 are hexadecimal numbers. The last <n> mapp 110 are hexadecimal numbers. The last <n> mappings are repeated in the next <m> 111 slots. 111 slots. 112 112 113 Status 113 Status 114 ====== 114 ====== 115 115 116 No status line is reported. 116 No status line is reported. 117 117 118 Example 118 Example 119 ======= 119 ======= 120 120 121 Assume that you have volumes vg1/switch0 vg1/s 121 Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with 122 the same size. 122 the same size. 123 123 124 Create a switch device with 64kB region size:: 124 Create a switch device with 64kB region size:: 125 125 126 dmsetup create switch --table "0 `blockdev 126 dmsetup create switch --table "0 `blockdev --getsz /dev/vg1/switch0` 127 switch 3 128 0 /dev/vg1/switch0 0 /dev 127 switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0" 128 128 129 Set mappings for the first 7 entries to point 129 Set mappings for the first 7 entries to point to devices switch0, switch1, 130 switch2, switch0, switch1, switch2, switch1:: 130 switch2, switch0, switch1, switch2, switch1:: 131 131 132 dmsetup message switch 0 set_region_mappin 132 dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1 133 133 134 Set repetitive mapping. This command:: 134 Set repetitive mapping. This command:: 135 135 136 dmsetup message switch 0 set_region_mappin 136 dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10 137 137 138 is equivalent to:: 138 is equivalent to:: 139 139 140 dmsetup message switch 0 set_region_mappin 140 dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \ 141 :1 :2 :1 :2 :1 :2 :1 :2 :1 :2 141 :1 :2 :1 :2 :1 :2 :1 :2 :1 :2
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.