~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~
aes-gcm-avx10-x86_64.S

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~
Diff markup

Differences between /arch/x86/crypto/aes-gcm-avx10-x86_64.S (Architecture mips) and /arch/sparc64/crypto/aes-gcm-avx10-x86_64.S (Architecture sparc64)

  1 /* SPDX-License-Identifier: Apache-2.0 OR BSD-    
  2 //                                                
  3 // VAES and VPCLMULQDQ optimized AES-GCM for x    
  4 //                                                
  5 // Copyright 2024 Google LLC                      
  6 //                                                
  7 // Author: Eric Biggers <ebiggers@google.com>      
  8 //                                                
  9 //--------------------------------------------    
 10 //                                                
 11 // This file is dual-licensed, meaning that yo    
 12 // either of the following two licenses:          
 13 //                                                
 14 // Licensed under the Apache License 2.0 (the     
 15 // of the License at                              
 16 //                                                
 17 //      http://www.apache.org/licenses/LICENSE    
 18 //                                                
 19 // Unless required by applicable law or agreed    
 20 // distributed under the License is distribute    
 21 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIN    
 22 // See the License for the specific language g    
 23 // limitations under the License.                 
 24 //                                                
 25 // or                                             
 26 //                                                
 27 // Redistribution and use in source and binary    
 28 // modification, are permitted provided that t    
 29 //                                                
 30 // 1. Redistributions of source code must reta    
 31 //    this list of conditions and the followin    
 32 //                                                
 33 // 2. Redistributions in binary form must repr    
 34 //    notice, this list of conditions and the     
 35 //    documentation and/or other materials pro    
 36 //                                                
 37 // THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT     
 38 // AND ANY EXPRESS OR IMPLIED WARRANTIES, INCL    
 39 // IMPLIED WARRANTIES OF MERCHANTABILITY AND F    
 40 // ARE DISCLAIMED. IN NO EVENT SHALL THE COPYR    
 41 // LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL    
 42 // CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT L    
 43 // SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,     
 44 // INTERRUPTION) HOWEVER CAUSED AND ON ANY THE    
 45 // CONTRACT, STRICT LIABILITY, OR TORT (INCLUD    
 46 // ARISING IN ANY WAY OUT OF THE USE OF THIS S    
 47 // POSSIBILITY OF SUCH DAMAGE.                    
 48 //                                                
 49 //--------------------------------------------    
 50 //                                                
 51 // This file implements AES-GCM (Galois/Counte    
 52 // support VAES (vector AES), VPCLMULQDQ (vect    
 53 // either AVX512 or AVX10.  Some of the functi    
 54 // decryption update functions which are the m    
 55 // provided in two variants generated from a m    
 56 // (suffix: vaes_avx10_256) and one using 512-    
 57 // other, "shared" functions (vaes_avx10) use     
 58 //                                                
 59 // The functions that use 512-bit vectors are     
 60 // 512-bit vectors *and* where using them does    
 61 // downclocking.  They require the following C    
 62 //                                                
 63 //      VAES && VPCLMULQDQ && BMI2 && ((AVX512    
 64 //                                                
 65 // The other functions require the following C    
 66 //                                                
 67 //      VAES && VPCLMULQDQ && BMI2 && ((AVX512    
 68 //                                                
 69 // All functions use the "System V" ABI.  The     
 70 //                                                
 71 // Note that we use "avx10" in the names of th    
 72 // really mean "AVX10 or a certain set of AVX5    
 73 // introduction of AVX512 and then its replace    
 74 // to be a simple way to name things that make    
 75 //                                                
 76 // Note that the macros that support both 256-    
 77 // fairly easily be changed to support 128-bit    
 78 // be sufficient to allow the code to run on C    
 79 // because the code heavily uses several featu    
 80 // the vector length: the increase in the numb    
 81 // 32, masking support, and new instructions s    
 82 // three-argument XOR).  These features are ve    
 83                                                   
 84 #include <linux/linkage.h>                        
 85                                                   
 86 .section .rodata                                  
 87 .p2align 6                                        
 88                                                   
 89         // A shuffle mask that reflects the by    
 90 .Lbswap_mask:                                     
 91         .octa   0x000102030405060708090a0b0c0d    
 92                                                   
 93         // This is the GHASH reducing polynomi    
 94         // x^128 + x^7 + x^2 + x, represented     
 95         // between bits and polynomial coeffic    
 96         //                                        
 97         // Alternatively, it can be interprete    
 98         // representation of the polynomial x^    
 99         // "reversed" GHASH reducing polynomia    
100 .Lgfpoly:                                         
101         .octa   0xc200000000000000000000000000    
102                                                   
103         // Same as above, but with the (1 << 6    
104 .Lgfpoly_and_internal_carrybit:                   
105         .octa   0xc200000000000001000000000000    
106                                                   
107         // The below constants are used for in    
108         // ctr_pattern points to the four 128-    
109         // inc_2blocks and inc_4blocks point t    
110         // 4.  Note that the same '2' is reuse    
111 .Lctr_pattern:                                    
112         .octa   0                                 
113         .octa   1                                 
114 .Linc_2blocks:                                    
115         .octa   2                                 
116         .octa   3                                 
117 .Linc_4blocks:                                    
118         .octa   4                                 
119                                                   
120 // Number of powers of the hash key stored in     
121 // stored from highest (H^NUM_H_POWERS) to low    
122 #define NUM_H_POWERS            16                
123                                                   
124 // Offset to AES key length (in bytes) in the     
125 #define OFFSETOF_AESKEYLEN      480               
126                                                   
127 // Offset to start of hash key powers array in    
128 #define OFFSETOF_H_POWERS       512               
129                                                   
130 // Offset to end of hash key powers array in t    
131 //                                                
132 // This is immediately followed by three zeroi    
133 // included so that partial vectors can be han    
134 // and two blocks remain, we load the 4 values    
135 // padding blocks needed is 3, which occurs if    
136 #define OFFSETOFEND_H_POWERS    (OFFSETOF_H_PO    
137                                                   
138 .text                                             
139                                                   
140 // Set the vector length in bytes.  This sets     
141 // register aliases V0-V31 that map to the ymm    
142 .macro  _set_veclen     vl                        
143         .set    VL,     \vl                       
144 .irp i, 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,    
145         16,17,18,19,20,21,22,23,24,25,26,27,28    
146 .if VL == 32                                      
147         .set    V\i,    %ymm\i                    
148 .elseif VL == 64                                  
149         .set    V\i,    %zmm\i                    
150 .else                                             
151         .error "Unsupported vector length"        
152 .endif                                            
153 .endr                                             
154 .endm                                             
155                                                   
156 // The _ghash_mul_step macro does one step of     
157 // 128-bit lanes of \a by the corresponding 12    
158 // reduced products in \dst.  \t0, \t1, and \t    
159 // same size as \a and \b.  To complete all st    
160 // through \i=9.  The division into steps allo    
161 // optionally interleave the computation with     
162 // macro must preserve the parameter registers    
163 //                                                
164 // The multiplications are done in GHASH's rep    
165 // GF(2^128).  Elements of GF(2^128) are repre    
166 // (i.e. polynomials whose coefficients are bi    
167 // G.  The GCM specification uses G = x^128 +     
168 // just XOR, while multiplication is more comp    
169 // carryless multiplication of two 128-bit inp    
170 // intermediate product polynomial, and (b) re    
171 // 128 bits by adding multiples of G that canc    
172 // multiples of G doesn't change which field e    
173 //                                                
174 // Unfortunately, the GCM specification maps b    
175 // coefficients backwards from the natural ord    
176 // highest bit to be the lowest order polynomi    
177 // This makes it nontrivial to work with the G    
178 // reflect the bits, but x86 doesn't have an i    
179 //                                                
180 // Instead, we operate on the values without b    
181 // just works, since XOR and carryless multipl    
182 // to bit order, but it has some consequences.    
183 // order, by skipping bit reflection, *byte* r    
184 // give the polynomial terms a consistent orde    
185 // value interpreted using the G = x^128 + x^7    
186 // through N-1 of the byte-reflected value rep    
187 // through x^0, whereas bits 0 through N-1 of     
188 // represent x^7...x^0, x^15...x^8, ..., x^(N-    
189 // with.  Fortunately, x86's vpshufb instructi    
190 //                                                
191 // Second, forgoing the bit reflection causes     
192 // using the G = x^128 + x^7 + x^2 + x + 1 con    
193 // multiplication.  This is because an M-bit b    
194 // really produces a (M+N-1)-bit product, but     
195 // M+N bits.  In the G = x^128 + x^7 + x^2 + x    
196 // to polynomial coefficients backwards, this     
197 // the product by introducing an extra factor     
198 // macro must ensure that one of the inputs ha    
199 // the multiplicative inverse of x, to cancel     
200 //                                                
201 // Third, the backwards coefficients conventio    
202 // since it makes "low" and "high" in the poly    
203 // their normal meaning in computer programmin    
204 // alternative interpretation: the polynomial     
205 // in the natural order, and the multiplicatio    
206 // x^128 + x^127 + x^126 + x^121 + 1.  This do    
207 // or the implementation at all; it just chang    
208 // of what each instruction is doing.  Startin    
209 // alternative interpretation, as it's easier     
210 //                                                
211 // Moving onto the implementation, the vpclmul    
212 // 128-bit carryless multiplication, so we bre    
213 // into parts as follows (the _L and _H suffix    
214 //                                                
215 //     LO = a_L * b_L                             
216 //     MI = (a_L * b_H) + (a_H * b_L)             
217 //     HI = a_H * b_H                             
218 //                                                
219 // The 256-bit product is x^128*HI + x^64*MI +    
220 // Note that MI "overlaps" with LO and HI.  We    
221 // HI right away, since the way the reduction     
222 //                                                
223 // For the reduction, we cancel out the low 12    
224 // x^128 + x^127 + x^126 + x^121 + 1.  This is    
225 // which cancels out the next lowest 64 bits.     
226 // where A and B are 128-bit.  Adding B_L*G to    
227 //                                                
228 //       x^64*A + B + B_L*G                       
229 //     = x^64*A + x^64*B_H + B_L + B_L*(x^128     
230 //     = x^64*A + x^64*B_H + B_L + x^128*B_L +    
231 //     = x^64*A + x^64*B_H + x^128*B_L + x^64*    
232 //     = x^64*(A + B_H + x^64*B_L + B_L*(x^63     
233 //                                                
234 // So: if we sum A, B with its halves swapped,    
235 // + x^62 + x^57, we get a 128-bit value C whe    
236 // original value x^64*A + B.  I.e., the low 6    
237 //                                                
238 // We just need to apply this twice: first to     
239 // fold the updated MI into HI.                   
240 //                                                
241 // The needed three-argument XORs are done usi    
242 // immediate 0x96, since this is faster than t    
243 //                                                
244 // A potential optimization, assuming that b i    
245 // per-key it would work the other way around)    
246 // reduction described above to precompute a v    
247 // and then multiply a_L by c (and implicitly     
248 //                                                
249 //     MI = (a_L * c_L) + (a_H * b_L)             
250 //     HI = (a_L * c_H) + (a_H * b_H)             
251 //                                                
252 // This would eliminate the LO part of the int    
253 // eliminate the need to fold LO into MI.  Thi    
254 // including a vpclmulqdq.  However, we curren    
255 // because it would require twice as many per-    
256 //                                                
257 // Using Karatsuba multiplication instead of "    
258 // similarly would save a vpclmulqdq but does     
259 .macro  _ghash_mul_step i, a, b, dst, gfpoly,     
260 .if \i == 0                                       
261         vpclmulqdq      $0x00, \a, \b, \t0        
262         vpclmulqdq      $0x01, \a, \b, \t1        
263 .elseif \i == 1                                   
264         vpclmulqdq      $0x10, \a, \b, \t2        
265 .elseif \i == 2                                   
266         vpxord          \t2, \t1, \t1             
267 .elseif \i == 3                                   
268         vpclmulqdq      $0x01, \t0, \gfpoly, \    
269 .elseif \i == 4                                   
270         vpshufd         $0x4e, \t0, \t0           
271 .elseif \i == 5                                   
272         vpternlogd      $0x96, \t2, \t0, \t1      
273 .elseif \i == 6                                   
274         vpclmulqdq      $0x11, \a, \b, \dst       
275 .elseif \i == 7                                   
276         vpclmulqdq      $0x01, \t1, \gfpoly, \    
277 .elseif \i == 8                                   
278         vpshufd         $0x4e, \t1, \t1           
279 .elseif \i == 9                                   
280         vpternlogd      $0x96, \t0, \t1, \dst     
281 .endif                                            
282 .endm                                             
283                                                   
284 // GHASH-multiply the 128-bit lanes of \a by t    
285 // the reduced products in \dst.  See _ghash_m    
286 .macro  _ghash_mul      a, b, dst, gfpoly, t0,    
287 .irp i, 0,1,2,3,4,5,6,7,8,9                       
288         _ghash_mul_step \i, \a, \b, \dst, \gfp    
289 .endr                                             
290 .endm                                             
291                                                   
292 // GHASH-multiply the 128-bit lanes of \a by t    
293 // *unreduced* products to \lo, \mi, and \hi.     
294 .macro  _ghash_mul_noreduce     a, b, lo, mi,     
295         vpclmulqdq      $0x00, \a, \b, \t0        
296         vpclmulqdq      $0x01, \a, \b, \t1        
297         vpclmulqdq      $0x10, \a, \b, \t2        
298         vpclmulqdq      $0x11, \a, \b, \t3        
299         vpxord          \t0, \lo, \lo             
300         vpternlogd      $0x96, \t2, \t1, \mi      
301         vpxord          \t3, \hi, \hi             
302 .endm                                             
303                                                   
304 // Reduce the unreduced products from \lo, \mi    
305 // reduced products in \hi.  See _ghash_mul_st    
306 .macro  _ghash_reduce   lo, mi, hi, gfpoly, t0    
307         vpclmulqdq      $0x01, \lo, \gfpoly, \    
308         vpshufd         $0x4e, \lo, \lo           
309         vpternlogd      $0x96, \t0, \lo, \mi      
310         vpclmulqdq      $0x01, \mi, \gfpoly, \    
311         vpshufd         $0x4e, \mi, \mi           
312         vpternlogd      $0x96, \t0, \mi, \hi      
313 .endm                                             
314                                                   
315 // void aes_gcm_precompute_##suffix(struct aes    
316 //                                                
317 // Given the expanded AES key |key->aes_key|,     
318 // subkey and initializes |key->ghash_key_powe    
319 //                                                
320 // The number of key powers initialized is NUM    
321 // the order H^NUM_H_POWERS to H^1.  The zeroi    
322 // powers themselves are also initialized.        
323 //                                                
324 // This macro supports both VL=32 and VL=64.      
325 // with the desired length.  In the VL=32 case    
326 // many key powers than are actually used by t    
327 // This is done to keep the key format the sam    
328 .macro  _aes_gcm_precompute                       
329                                                   
330         // Function arguments                     
331         .set    KEY,            %rdi              
332                                                   
333         // Additional local variables.  V0-V2     
334         .set    POWERS_PTR,     %rsi              
335         .set    RNDKEYLAST_PTR, %rdx              
336         .set    H_CUR,          V3                
337         .set    H_CUR_YMM,      %ymm3             
338         .set    H_CUR_XMM,      %xmm3             
339         .set    H_INC,          V4                
340         .set    H_INC_YMM,      %ymm4             
341         .set    H_INC_XMM,      %xmm4             
342         .set    GFPOLY,         V5                
343         .set    GFPOLY_YMM,     %ymm5             
344         .set    GFPOLY_XMM,     %xmm5             
345                                                   
346         // Get pointer to lowest set of key po    
347         lea             OFFSETOFEND_H_POWERS-V    
348                                                   
349         // Encrypt an all-zeroes block to get     
350         movl            OFFSETOF_AESKEYLEN(KEY    
351         lea             6*16(KEY,%rax,4), RNDK    
352         vmovdqu         (KEY), %xmm0  // Zero-    
353         add             $16, KEY                  
354 1:                                                
355         vaesenc         (KEY), %xmm0, %xmm0       
356         add             $16, KEY                  
357         cmp             KEY, RNDKEYLAST_PTR       
358         jne             1b                        
359         vaesenclast     (RNDKEYLAST_PTR), %xmm    
360                                                   
361         // Reflect the bytes of the raw hash s    
362         vpshufb         .Lbswap_mask(%rip), %x    
363                                                   
364         // Zeroize the padding blocks.            
365         vpxor           %xmm0, %xmm0, %xmm0       
366         vmovdqu         %ymm0, VL(POWERS_PTR)     
367         vmovdqu         %xmm0, VL+2*16(POWERS_    
368                                                   
369         // Finish preprocessing the first key     
370         // implementation operates directly on    
371         // order specified by the GCM standard    
372         // raw key as follows.  First, reflect    
373         // by x^-1 mod x^128 + x^7 + x^2 + x +    
374         // interpretation of polynomial coeffi    
375         // interpreted as multiplication by x     
376         // + 1 using the alternative, natural     
377         // coefficients.  For details, see the    
378         //                                        
379         // Either way, for the multiplication     
380         // is a left shift of the 128-bit valu    
381         // << 120) | 1 if a 1 bit was carried     
382         // wide shift instruction, so instead     
383         // halves and incorporate the internal    
384         vpshufd         $0xd3, H_CUR_XMM, %xmm    
385         vpsrad          $31, %xmm0, %xmm0         
386         vpaddq          H_CUR_XMM, H_CUR_XMM,     
387         vpand           .Lgfpoly_and_internal_    
388         vpxor           %xmm0, H_CUR_XMM, H_CU    
389                                                   
390         // Load the gfpoly constant.              
391         vbroadcasti32x4 .Lgfpoly(%rip), GFPOLY    
392                                                   
393         // Square H^1 to get H^2.                 
394         //                                        
395         // Note that as with H^1, all higher k    
396         // factor of x^-1 (or x using the natu    
397         // special needs to be done to make th    
398         // end up with two factors of x^-1, bu    
399         // So the product H^2 ends up with the    
400         _ghash_mul      H_CUR_XMM, H_CUR_XMM,     
401                         %xmm0, %xmm1, %xmm2       
402                                                   
403         // Create H_CUR_YMM = [H^2, H^1] and H    
404         vinserti128     $1, H_CUR_XMM, H_INC_Y    
405         vinserti128     $1, H_INC_XMM, H_INC_Y    
406                                                   
407 .if VL == 64                                      
408         // Create H_CUR = [H^4, H^3, H^2, H^1]    
409         _ghash_mul      H_INC_YMM, H_CUR_YMM,     
410                         %ymm0, %ymm1, %ymm2       
411         vinserti64x4    $1, H_CUR_YMM, H_INC,     
412         vshufi64x2      $0, H_INC, H_INC, H_IN    
413 .endif                                            
414                                                   
415         // Store the lowest set of key powers.    
416         vmovdqu8        H_CUR, (POWERS_PTR)       
417                                                   
418         // Compute and store the remaining key    
419         // multiply [H^(i+1), H^i] by [H^2, H^    
420         // With VL=64, repeatedly multiply [H^    
421         // [H^4, H^4, H^4, H^4] to get [H^(i+7    
422         mov             $(NUM_H_POWERS*16/VL)     
423 .Lprecompute_next\@:                              
424         sub             $VL, POWERS_PTR           
425         _ghash_mul      H_INC, H_CUR, H_CUR, G    
426         vmovdqu8        H_CUR, (POWERS_PTR)       
427         dec             %eax                      
428         jnz             .Lprecompute_next\@       
429                                                   
430         vzeroupper      // This is needed afte    
431         RET                                       
432 .endm                                             
433                                                   
434 // XOR together the 128-bit lanes of \src (who    
435 // the result in \dst_xmm.  This implicitly ze    
436 .macro  _horizontal_xor src, src_xmm, dst_xmm,    
437         vextracti32x4   $1, \src, \t0_xmm         
438 .if VL == 32                                      
439         vpxord          \t0_xmm, \src_xmm, \ds    
440 .elseif VL == 64                                  
441         vextracti32x4   $2, \src, \t1_xmm         
442         vextracti32x4   $3, \src, \t2_xmm         
443         vpxord          \t0_xmm, \src_xmm, \ds    
444         vpternlogd      $0x96, \t1_xmm, \t2_xm    
445 .else                                             
446         .error "Unsupported vector length"        
447 .endif                                            
448 .endm                                             
449                                                   
450 // Do one step of the GHASH update of the data    
451 // registers GHASHDATA[0-3].  \i specifies the    
452 // division into steps allows users of this ma    
453 // computation with other instructions.  This     
454 // GHASH_ACC as input/output; GHASHDATA[0-3] a    
455 // H_POW[4-1], GFPOLY, and BSWAP_MASK as input    
456 // GHASHTMP[0-2] as temporaries.  This macro h    
457 // data blocks.  The parameter registers must     
458 //                                                
459 // The GHASH update does: GHASH_ACC = H_POW4*(    
460 // H_POW3*GHASHDATA1 + H_POW2*GHASHDATA2 + H_P    
461 // operations are vectorized operations on vec    
462 // with VL=32 there are 2 blocks per vector an    
463 // to the following non-vectorized terms:         
464 //                                                
465 //      H_POW4*(GHASHDATA0 + GHASH_ACC) => H^8    
466 //      H_POW3*GHASHDATA1 => H^6*blk2 and H^5*    
467 //      H_POW2*GHASHDATA2 => H^4*blk4 and H^3*    
468 //      H_POW1*GHASHDATA3 => H^2*blk6 and H^1*    
469 //                                                
470 // With VL=64, we use 4 blocks/vector, H^16 th    
471 //                                                
472 // More concretely, this code does:               
473 //   - Do vectorized "schoolbook" multiplicati    
474 //     256-bit product of each block and its c    
475 //     There are 4*VL/16 of these intermediate    
476 //   - Sum (XOR) the intermediate 256-bit prod    
477 //     VL/16 256-bit intermediate values.         
478 //   - Do a vectorized reduction of these 256-    
479 //     128-bits each.  This leaves VL/16 128-b    
480 //   - Sum (XOR) these values and store the 12    
481 //                                                
482 // See _ghash_mul_step for the full explanatio    
483 // each individual finite field multiplication    
484 .macro  _ghash_step_4x  i                         
485 .if \i == 0                                       
486         vpshufb         BSWAP_MASK, GHASHDATA0    
487         vpxord          GHASH_ACC, GHASHDATA0,    
488         vpshufb         BSWAP_MASK, GHASHDATA1    
489         vpshufb         BSWAP_MASK, GHASHDATA2    
490 .elseif \i == 1                                   
491         vpshufb         BSWAP_MASK, GHASHDATA3    
492         vpclmulqdq      $0x00, H_POW4, GHASHDA    
493         vpclmulqdq      $0x00, H_POW3, GHASHDA    
494         vpclmulqdq      $0x00, H_POW2, GHASHDA    
495 .elseif \i == 2                                   
496         vpxord          GHASHTMP0, GHASH_ACC,     
497         vpclmulqdq      $0x00, H_POW1, GHASHDA    
498         vpternlogd      $0x96, GHASHTMP2, GHAS    
499         vpclmulqdq      $0x01, H_POW4, GHASHDA    
500 .elseif \i == 3                                   
501         vpclmulqdq      $0x01, H_POW3, GHASHDA    
502         vpclmulqdq      $0x01, H_POW2, GHASHDA    
503         vpternlogd      $0x96, GHASHTMP2, GHAS    
504         vpclmulqdq      $0x01, H_POW1, GHASHDA    
505 .elseif \i == 4                                   
506         vpclmulqdq      $0x10, H_POW4, GHASHDA    
507         vpternlogd      $0x96, GHASHTMP2, GHAS    
508         vpclmulqdq      $0x10, H_POW3, GHASHDA    
509         vpclmulqdq      $0x10, H_POW2, GHASHDA    
510 .elseif \i == 5                                   
511         vpternlogd      $0x96, GHASHTMP2, GHAS    
512         vpclmulqdq      $0x01, GHASH_ACC, GFPO    
513         vpclmulqdq      $0x10, H_POW1, GHASHDA    
514         vpxord          GHASHTMP1, GHASHTMP0,     
515 .elseif \i == 6                                   
516         vpshufd         $0x4e, GHASH_ACC, GHAS    
517         vpclmulqdq      $0x11, H_POW4, GHASHDA    
518         vpclmulqdq      $0x11, H_POW3, GHASHDA    
519         vpclmulqdq      $0x11, H_POW2, GHASHDA    
520 .elseif \i == 7                                   
521         vpternlogd      $0x96, GHASHTMP2, GHAS    
522         vpclmulqdq      $0x11, H_POW1, GHASHDA    
523         vpternlogd      $0x96, GHASHDATA2, GHA    
524         vpclmulqdq      $0x01, GHASHTMP0, GFPO    
525 .elseif \i == 8                                   
526         vpxord          GHASHDATA3, GHASHDATA0    
527         vpshufd         $0x4e, GHASHTMP0, GHAS    
528         vpternlogd      $0x96, GHASHTMP1, GHAS    
529 .elseif \i == 9                                   
530         _horizontal_xor GHASH_ACC, GHASH_ACC_X    
531                         GHASHDATA0_XMM, GHASHD    
532 .endif                                            
533 .endm                                             
534                                                   
535 // Do one non-last round of AES encryption on     
536 // the round key that has been broadcast to al    
537 .macro  _vaesenc_4x     round_key                 
538         vaesenc         \round_key, V0, V0        
539         vaesenc         \round_key, V1, V1        
540         vaesenc         \round_key, V2, V2        
541         vaesenc         \round_key, V3, V3        
542 .endm                                             
543                                                   
544 // Start the AES encryption of four vectors of    
545 .macro  _ctr_begin_4x                             
546                                                   
547         // Increment LE_CTR four times to gene    
548         // counter blocks, swap each to big-en    
549         vpshufb         BSWAP_MASK, LE_CTR, V0    
550         vpaddd          LE_CTR_INC, LE_CTR, LE    
551         vpshufb         BSWAP_MASK, LE_CTR, V1    
552         vpaddd          LE_CTR_INC, LE_CTR, LE    
553         vpshufb         BSWAP_MASK, LE_CTR, V2    
554         vpaddd          LE_CTR_INC, LE_CTR, LE    
555         vpshufb         BSWAP_MASK, LE_CTR, V3    
556         vpaddd          LE_CTR_INC, LE_CTR, LE    
557                                                   
558         // AES "round zero": XOR in the zero-t    
559         vpxord          RNDKEY0, V0, V0           
560         vpxord          RNDKEY0, V1, V1           
561         vpxord          RNDKEY0, V2, V2           
562         vpxord          RNDKEY0, V3, V3           
563 .endm                                             
564                                                   
565 // void aes_gcm_{enc,dec}_update_##suffix(cons    
566 //                                        cons    
567 //                                        cons    
568 //                                                
569 // This macro generates a GCM encryption or de    
570 // above prototype (with \enc selecting which     
571 // VL=32 and VL=64.  _set_veclen must have bee    
572 //                                                
573 // This function computes the next portion of     
574 // |datalen| bytes from |src|, and writes the     
575 // data to |dst|.  It also updates the GHASH a    
576 // next |datalen| ciphertext bytes.               
577 //                                                
578 // |datalen| must be a multiple of 16, except     
579 // any length.  The caller must do any bufferi    
580 // in-place and out-of-place en/decryption are    
581 //                                                
582 // |le_ctr| must give the current counter in l    
583 // message, the low word of the counter must b    
584 // counter from |le_ctr| and increments the lo    
585 // does *not* store the updated counter back t    
586 // update |le_ctr| if any more data segments f    
587 // 32-bit word of the counter is incremented,     
588 .macro  _aes_gcm_update enc                       
589                                                   
590         // Function arguments                     
591         .set    KEY,            %rdi              
592         .set    LE_CTR_PTR,     %rsi              
593         .set    GHASH_ACC_PTR,  %rdx              
594         .set    SRC,            %rcx              
595         .set    DST,            %r8               
596         .set    DATALEN,        %r9d              
597         .set    DATALEN64,      %r9     // Zer    
598                                                   
599         // Additional local variables             
600                                                   
601         // %rax and %k1 are used as temporary     
602         // available as a temporary register a    
603                                                   
604         // AES key length in bytes                
605         .set    AESKEYLEN,      %r10d             
606         .set    AESKEYLEN64,    %r10              
607                                                   
608         // Pointer to the last AES round key f    
609         .set    RNDKEYLAST_PTR, %r11              
610                                                   
611         // In the main loop, V0-V3 are used as    
612         // they are used as temporary register    
613                                                   
614         // GHASHDATA[0-3] hold the ciphertext     
615         .set    GHASHDATA0,     V4                
616         .set    GHASHDATA0_XMM, %xmm4             
617         .set    GHASHDATA1,     V5                
618         .set    GHASHDATA1_XMM, %xmm5             
619         .set    GHASHDATA2,     V6                
620         .set    GHASHDATA2_XMM, %xmm6             
621         .set    GHASHDATA3,     V7                
622                                                   
623         // BSWAP_MASK is the shuffle mask for     
624         // using vpshufb, copied to all 128-bi    
625         .set    BSWAP_MASK,     V8                
626                                                   
627         // RNDKEY temporarily holds the next A    
628         .set    RNDKEY,         V9                
629                                                   
630         // GHASH_ACC is the accumulator variab    
631         // only the lowest 128-bit lane can be    
632         // more than one lane may be used, and    
633         .set    GHASH_ACC,      V10               
634         .set    GHASH_ACC_XMM,  %xmm10            
635                                                   
636         // LE_CTR_INC is the vector of 32-bit     
637         // vector of little-endian counter blo    
638         .set    LE_CTR_INC,     V11               
639                                                   
640         // LE_CTR contains the next set of lit    
641         .set    LE_CTR,         V12               
642                                                   
643         // RNDKEY0, RNDKEYLAST, and RNDKEY_M[9    
644         // copied to all 128-bit lanes.  RNDKE    
645         // RNDKEYLAST the last, and RNDKEY_M\i    
646         .set    RNDKEY0,        V13               
647         .set    RNDKEYLAST,     V14               
648         .set    RNDKEY_M9,      V15               
649         .set    RNDKEY_M8,      V16               
650         .set    RNDKEY_M7,      V17               
651         .set    RNDKEY_M6,      V18               
652         .set    RNDKEY_M5,      V19               
653                                                   
654         // RNDKEYLAST[0-3] temporarily store t    
655         // the corresponding block of source d    
656         // vaesenclast(key, a) ^ b == vaesencl    
657         // be computed in parallel with the AE    
658         .set    RNDKEYLAST0,    V20               
659         .set    RNDKEYLAST1,    V21               
660         .set    RNDKEYLAST2,    V22               
661         .set    RNDKEYLAST3,    V23               
662                                                   
663         // GHASHTMP[0-2] are temporary variabl    
664         // cannot coincide with anything used     
665         // performance reasons GHASH and AES e    
666         .set    GHASHTMP0,      V24               
667         .set    GHASHTMP1,      V25               
668         .set    GHASHTMP2,      V26               
669                                                   
670         // H_POW[4-1] contain the powers of th    
671         // descending numbering reflects the o    
672         .set    H_POW4,         V27               
673         .set    H_POW3,         V28               
674         .set    H_POW2,         V29               
675         .set    H_POW1,         V30               
676                                                   
677         // GFPOLY contains the .Lgfpoly consta    
678         .set    GFPOLY,         V31               
679                                                   
680         // Load some constants.                   
681         vbroadcasti32x4 .Lbswap_mask(%rip), BS    
682         vbroadcasti32x4 .Lgfpoly(%rip), GFPOLY    
683                                                   
684         // Load the GHASH accumulator and the     
685         vmovdqu         (GHASH_ACC_PTR), GHASH    
686         vbroadcasti32x4 (LE_CTR_PTR), LE_CTR      
687                                                   
688         // Load the AES key length in bytes.      
689         movl            OFFSETOF_AESKEYLEN(KEY    
690                                                   
691         // Make RNDKEYLAST_PTR point to the la    
692         // round key with index 10, 12, or 14     
693         // respectively.  Then load the zero-t    
694         lea             6*16(KEY,AESKEYLEN64,4    
695         vbroadcasti32x4 (KEY), RNDKEY0            
696         vbroadcasti32x4 (RNDKEYLAST_PTR), RNDK    
697                                                   
698         // Finish initializing LE_CTR by addin    
699         vpaddd          .Lctr_pattern(%rip), L    
700                                                   
701         // Initialize LE_CTR_INC to contain VL    
702 .if VL == 32                                      
703         vbroadcasti32x4 .Linc_2blocks(%rip), L    
704 .elseif VL == 64                                  
705         vbroadcasti32x4 .Linc_4blocks(%rip), L    
706 .else                                             
707         .error "Unsupported vector length"        
708 .endif                                            
709                                                   
710         // If there are at least 4*VL bytes of    
711         // that processes 4*VL bytes of data a    
712         //                                        
713         // Pre-subtracting 4*VL from DATALEN s    
714         // loop and also ensures that at least    
715         // DATALEN, zero-extending it and allo    
716         sub             $4*VL, DATALEN            
717         jl              .Lcrypt_loop_4x_done\@    
718                                                   
719         // Load powers of the hash key.           
720         vmovdqu8        OFFSETOFEND_H_POWERS-4    
721         vmovdqu8        OFFSETOFEND_H_POWERS-3    
722         vmovdqu8        OFFSETOFEND_H_POWERS-2    
723         vmovdqu8        OFFSETOFEND_H_POWERS-1    
724                                                   
725         // Main loop: en/decrypt and hash 4 ve    
726         //                                        
727         // When possible, interleave the AES e    
728         // with the GHASH update of the cipher    
729         // performance on many CPUs because th    
730         // instructions often differ from thos    
731         // instructions used in GHASH.  For ex    
732         // vaesenc to ports 0 and 1 and vpclmu    
733         //                                        
734         // The interleaving is easiest to do d    
735         // decryption the ciphertext blocks ar    
736         // encryption, instead encrypt the fir    
737         // blocks while encrypting the next se    
738         // needed, and finally hash the last s    
739                                                   
740 .if \enc                                          
741         // Encrypt the first 4 vectors of plai    
742         // ciphertext in GHASHDATA[0-3] for GH    
743         _ctr_begin_4x                             
744         lea             16(KEY), %rax             
745 1:                                                
746         vbroadcasti32x4 (%rax), RNDKEY            
747         _vaesenc_4x     RNDKEY                    
748         add             $16, %rax                 
749         cmp             %rax, RNDKEYLAST_PTR      
750         jne             1b                        
751         vpxord          0*VL(SRC), RNDKEYLAST,    
752         vpxord          1*VL(SRC), RNDKEYLAST,    
753         vpxord          2*VL(SRC), RNDKEYLAST,    
754         vpxord          3*VL(SRC), RNDKEYLAST,    
755         vaesenclast     RNDKEYLAST0, V0, GHASH    
756         vaesenclast     RNDKEYLAST1, V1, GHASH    
757         vaesenclast     RNDKEYLAST2, V2, GHASH    
758         vaesenclast     RNDKEYLAST3, V3, GHASH    
759         vmovdqu8        GHASHDATA0, 0*VL(DST)     
760         vmovdqu8        GHASHDATA1, 1*VL(DST)     
761         vmovdqu8        GHASHDATA2, 2*VL(DST)     
762         vmovdqu8        GHASHDATA3, 3*VL(DST)     
763         add             $4*VL, SRC                
764         add             $4*VL, DST                
765         sub             $4*VL, DATALEN            
766         jl              .Lghash_last_ciphertex    
767 .endif                                            
768                                                   
769         // Cache as many additional AES round     
770 .irp i, 9,8,7,6,5                                 
771         vbroadcasti32x4 -\i*16(RNDKEYLAST_PTR)    
772 .endr                                             
773                                                   
774 .Lcrypt_loop_4x\@:                                
775                                                   
776         // If decrypting, load more ciphertext    
777         // encrypting, GHASHDATA[0-3] already     
778 .if !\enc                                         
779         vmovdqu8        0*VL(SRC), GHASHDATA0     
780         vmovdqu8        1*VL(SRC), GHASHDATA1     
781         vmovdqu8        2*VL(SRC), GHASHDATA2     
782         vmovdqu8        3*VL(SRC), GHASHDATA3     
783 .endif                                            
784                                                   
785         // Start the AES encryption of the cou    
786         _ctr_begin_4x                             
787         cmp             $24, AESKEYLEN            
788         jl              128f    // AES-128?       
789         je              192f    // AES-192?       
790         // AES-256                                
791         vbroadcasti32x4 -13*16(RNDKEYLAST_PTR)    
792         _vaesenc_4x     RNDKEY                    
793         vbroadcasti32x4 -12*16(RNDKEYLAST_PTR)    
794         _vaesenc_4x     RNDKEY                    
795 192:                                              
796         vbroadcasti32x4 -11*16(RNDKEYLAST_PTR)    
797         _vaesenc_4x     RNDKEY                    
798         vbroadcasti32x4 -10*16(RNDKEYLAST_PTR)    
799         _vaesenc_4x     RNDKEY                    
800 128:                                              
801                                                   
802         // XOR the source data with the last r    
803         // RNDKEYLAST[0-3].  This reduces late    
804         // property vaesenclast(key, a) ^ b ==    
805 .if \enc                                          
806         vpxord          0*VL(SRC), RNDKEYLAST,    
807         vpxord          1*VL(SRC), RNDKEYLAST,    
808         vpxord          2*VL(SRC), RNDKEYLAST,    
809         vpxord          3*VL(SRC), RNDKEYLAST,    
810 .else                                             
811         vpxord          GHASHDATA0, RNDKEYLAST    
812         vpxord          GHASHDATA1, RNDKEYLAST    
813         vpxord          GHASHDATA2, RNDKEYLAST    
814         vpxord          GHASHDATA3, RNDKEYLAST    
815 .endif                                            
816                                                   
817         // Finish the AES encryption of the co    
818         // with the GHASH update of the cipher    
819 .irp i, 9,8,7,6,5                                 
820         _vaesenc_4x     RNDKEY_M\i                
821         _ghash_step_4x  (9 - \i)                  
822 .endr                                             
823 .irp i, 4,3,2,1                                   
824         vbroadcasti32x4 -\i*16(RNDKEYLAST_PTR)    
825         _vaesenc_4x     RNDKEY                    
826         _ghash_step_4x  (9 - \i)                  
827 .endr                                             
828         _ghash_step_4x  9                         
829                                                   
830         // Do the last AES round.  This handle    
831         // too, as per the optimization descri    
832         vaesenclast     RNDKEYLAST0, V0, GHASH    
833         vaesenclast     RNDKEYLAST1, V1, GHASH    
834         vaesenclast     RNDKEYLAST2, V2, GHASH    
835         vaesenclast     RNDKEYLAST3, V3, GHASH    
836                                                   
837         // Store the en/decrypted data to DST.    
838         vmovdqu8        GHASHDATA0, 0*VL(DST)     
839         vmovdqu8        GHASHDATA1, 1*VL(DST)     
840         vmovdqu8        GHASHDATA2, 2*VL(DST)     
841         vmovdqu8        GHASHDATA3, 3*VL(DST)     
842                                                   
843         add             $4*VL, SRC                
844         add             $4*VL, DST                
845         sub             $4*VL, DATALEN            
846         jge             .Lcrypt_loop_4x\@         
847                                                   
848 .if \enc                                          
849 .Lghash_last_ciphertext_4x\@:                     
850         // Update GHASH with the last set of c    
851 .irp i, 0,1,2,3,4,5,6,7,8,9                       
852         _ghash_step_4x  \i                        
853 .endr                                             
854 .endif                                            
855                                                   
856 .Lcrypt_loop_4x_done\@:                           
857                                                   
858         // Undo the extra subtraction by 4*VL     
859         add             $4*VL, DATALEN            
860         jz              .Ldone\@                  
861                                                   
862         // The data length isn't a multiple of    
863         // of length 1 <= DATALEN < 4*VL, up t    
864         // Going one vector at a time may seem    
865         // separate code paths for each possib    
866         // However, using a loop keeps the cod    
867         // surprising well; modern CPUs will s    
868         // before the previous one finishes an    
869         // iterations.  For a similar reason,     
870         //                                        
871         // On the last iteration, the remainin    
872         // Handle this using masking.             
873         //                                        
874         // Since there are enough key powers a    
875         // there is no need to do a GHASH redu    
876         // Instead, multiply each remaining bl    
877         // do a GHASH reduction at the very en    
878                                                   
879         // Make POWERS_PTR point to the key po    
880         // is the number of blocks that remain    
881         .set            POWERS_PTR, LE_CTR_PTR    
882         mov             DATALEN, %eax             
883         neg             %rax                      
884         and             $~15, %rax  // -round_    
885         lea             OFFSETOFEND_H_POWERS(K    
886                                                   
887         // Start collecting the unreduced GHAS    
888         .set            LO, GHASHDATA0            
889         .set            LO_XMM, GHASHDATA0_XMM    
890         .set            MI, GHASHDATA1            
891         .set            MI_XMM, GHASHDATA1_XMM    
892         .set            HI, GHASHDATA2            
893         .set            HI_XMM, GHASHDATA2_XMM    
894         vpxor           LO_XMM, LO_XMM, LO_XMM    
895         vpxor           MI_XMM, MI_XMM, MI_XMM    
896         vpxor           HI_XMM, HI_XMM, HI_XMM    
897                                                   
898 .Lcrypt_loop_1x\@:                                
899                                                   
900         // Select the appropriate mask for thi    
901         // DATALEN >= VL, otherwise DATALEN 1'    
902         // bzhi instruction from BMI2.  (This     
903 .if VL < 64                                       
904         mov             $-1, %eax                 
905         bzhi            DATALEN, %eax, %eax       
906         kmovd           %eax, %k1                 
907 .else                                             
908         mov             $-1, %rax                 
909         bzhi            DATALEN64, %rax, %rax     
910         kmovq           %rax, %k1                 
911 .endif                                            
912                                                   
913         // Encrypt a vector of counter blocks.    
914         vpshufb         BSWAP_MASK, LE_CTR, V0    
915         vpaddd          LE_CTR_INC, LE_CTR, LE    
916         vpxord          RNDKEY0, V0, V0           
917         lea             16(KEY), %rax             
918 1:                                                
919         vbroadcasti32x4 (%rax), RNDKEY            
920         vaesenc         RNDKEY, V0, V0            
921         add             $16, %rax                 
922         cmp             %rax, RNDKEYLAST_PTR      
923         jne             1b                        
924         vaesenclast     RNDKEYLAST, V0, V0        
925                                                   
926         // XOR the data with the appropriate n    
927         vmovdqu8        (SRC), V1{%k1}{z}         
928         vpxord          V1, V0, V0                
929         vmovdqu8        V0, (DST){%k1}            
930                                                   
931         // Update GHASH with the ciphertext bl    
932         //                                        
933         // In the case of DATALEN < VL, the ci    
934         // (If decrypting, it's done by the ab    
935         // it's done by the below masked regis    
936         // if DATALEN <= VL - 16, there will b    
937         // padding of the last block specified    
938         // be whole block(s) that get processe    
939         // reduction instructions but should n    
940         // GHASH.  However, any such blocks ar    
941         // they're multiplied with are also al    
942         // 0 * 0 = 0 to the final GHASH result    
943         vmovdqu8        (POWERS_PTR), H_POW1      
944 .if \enc                                          
945         vmovdqu8        V0, V1{%k1}{z}            
946 .endif                                            
947         vpshufb         BSWAP_MASK, V1, V0        
948         vpxord          GHASH_ACC, V0, V0         
949         _ghash_mul_noreduce     H_POW1, V0, LO    
950         vpxor           GHASH_ACC_XMM, GHASH_A    
951                                                   
952         add             $VL, POWERS_PTR           
953         add             $VL, SRC                  
954         add             $VL, DST                  
955         sub             $VL, DATALEN              
956         jg              .Lcrypt_loop_1x\@         
957                                                   
958         // Finally, do the GHASH reduction.       
959         _ghash_reduce   LO, MI, HI, GFPOLY, V0    
960         _horizontal_xor HI, HI_XMM, GHASH_ACC_    
961                                                   
962 .Ldone\@:                                         
963         // Store the updated GHASH accumulator    
964         vmovdqu         GHASH_ACC_XMM, (GHASH_    
965                                                   
966         vzeroupper      // This is needed afte    
967         RET                                       
968 .endm                                             
969                                                   
970 // void aes_gcm_enc_final_vaes_avx10(const str    
971 //                                   const u32    
972 //                                   u64 total    
973 // bool aes_gcm_dec_final_vaes_avx10(const str    
974 //                                   const u32    
975 //                                   const u8     
976 //                                   u64 total    
977 //                                   const u8     
978 //                                                
979 // This macro generates one of the above two f    
980 // which one).  Both functions finish computin    
981 // updating GHASH with the lengths block and e    
982 // |total_aadlen| and |total_datalen| must be     
983 // authenticated data and the en/decrypted dat    
984 //                                                
985 // The encryption function then stores the ful    
986 // authentication tag to |ghash_acc|.  The dec    
987 // expected authentication tag (the one that w    
988 // buffer |tag|, compares the first 4 <= |tagl    
989 // computed tag in constant time, and returns     
990 .macro  _aes_gcm_final  enc                       
991                                                   
992         // Function arguments                     
993         .set    KEY,            %rdi              
994         .set    LE_CTR_PTR,     %rsi              
995         .set    GHASH_ACC_PTR,  %rdx              
996         .set    TOTAL_AADLEN,   %rcx              
997         .set    TOTAL_DATALEN,  %r8               
998         .set    TAG,            %r9               
999         .set    TAGLEN,         %r10d   // Ori    
1000                                                  
1001         // Additional local variables.           
1002         // %rax, %xmm0-%xmm3, and %k1 are use    
1003         .set    AESKEYLEN,      %r11d            
1004         .set    AESKEYLEN64,    %r11             
1005         .set    GFPOLY,         %xmm4            
1006         .set    BSWAP_MASK,     %xmm5            
1007         .set    LE_CTR,         %xmm6            
1008         .set    GHASH_ACC,      %xmm7            
1009         .set    H_POW1,         %xmm8            
1010                                                  
1011         // Load some constants.                  
1012         vmovdqa         .Lgfpoly(%rip), GFPOL    
1013         vmovdqa         .Lbswap_mask(%rip), B    
1014                                                  
1015         // Load the AES key length in bytes.     
1016         movl            OFFSETOF_AESKEYLEN(KE    
1017                                                  
1018         // Set up a counter block with 1 in t    
1019         // counter that produces the cipherte    
1020         // GFPOLY has 1 in the low word, so g    
1021         vpblendd        $0xe, (LE_CTR_PTR), G    
1022                                                  
1023         // Build the lengths block and XOR it    
1024         // Although the lengths block is defi    
1025         // the en/decrypted data length, both    
1026         // reflection of the full block is ne    
1027         // GHASH (see _ghash_mul_step).  By u    
1028         // opposite order, we avoid having to    
1029         vmovq           TOTAL_DATALEN, %xmm0     
1030         vpinsrq         $1, TOTAL_AADLEN, %xm    
1031         vpsllq          $3, %xmm0, %xmm0         
1032         vpxor           (GHASH_ACC_PTR), %xmm    
1033                                                  
1034         // Load the first hash key power (H^1    
1035         vmovdqu8        OFFSETOFEND_H_POWERS-    
1036                                                  
1037 .if !\enc                                        
1038         // Prepare a mask of TAGLEN one bits.    
1039         movl            8(%rsp), TAGLEN          
1040         mov             $-1, %eax                
1041         bzhi            TAGLEN, %eax, %eax       
1042         kmovd           %eax, %k1                
1043 .endif                                           
1044                                                  
1045         // Make %rax point to the last AES ro    
1046         lea             6*16(KEY,AESKEYLEN64,    
1047                                                  
1048         // Start the AES encryption of the co    
1049         // block to big-endian and XOR-ing it    
1050         vpshufb         BSWAP_MASK, LE_CTR, %    
1051         vpxor           (KEY), %xmm0, %xmm0      
1052                                                  
1053         // Complete the AES encryption and mu    
1054         // Interleave the AES and GHASH instr    
1055         cmp             $24, AESKEYLEN           
1056         jl              128f    // AES-128?      
1057         je              192f    // AES-192?      
1058         // AES-256                               
1059         vaesenc         -13*16(%rax), %xmm0,     
1060         vaesenc         -12*16(%rax), %xmm0,     
1061 192:                                             
1062         vaesenc         -11*16(%rax), %xmm0,     
1063         vaesenc         -10*16(%rax), %xmm0,     
1064 128:                                             
1065 .irp i, 0,1,2,3,4,5,6,7,8                        
1066         _ghash_mul_step \i, H_POW1, GHASH_ACC    
1067                         %xmm1, %xmm2, %xmm3      
1068         vaesenc         (\i-9)*16(%rax), %xmm    
1069 .endr                                            
1070         _ghash_mul_step 9, H_POW1, GHASH_ACC,    
1071                         %xmm1, %xmm2, %xmm3      
1072                                                  
1073         // Undo the byte reflection of the GH    
1074         vpshufb         BSWAP_MASK, GHASH_ACC    
1075                                                  
1076         // Do the last AES round and XOR the     
1077         // GHASH accumulator to produce the f    
1078         //                                       
1079         // Reduce latency by taking advantage    
1080         // a) ^ b == vaesenclast(key ^ b, a).    
1081         // round key, instead of XOR'ing the     
1082         //                                       
1083         // enc_final then returns the compute    
1084         // compares it with the transmitted o    
1085         // the tags, dec_final XORs them toge    
1086         // whether the result is all-zeroes.     
1087         // dec_final applies the vaesenclast     
1088         // value XOR'd too, using vpternlogd     
1089         // accumulator, and transmitted auth     
1090 .if \enc                                         
1091         vpxor           (%rax), GHASH_ACC, %x    
1092         vaesenclast     %xmm1, %xmm0, GHASH_A    
1093         vmovdqu         GHASH_ACC, (GHASH_ACC    
1094 .else                                            
1095         vmovdqu         (TAG), %xmm1             
1096         vpternlogd      $0x96, (%rax), GHASH_    
1097         vaesenclast     %xmm1, %xmm0, %xmm0      
1098         xor             %eax, %eax               
1099         vmovdqu8        %xmm0, %xmm0{%k1}{z}     
1100         vptest          %xmm0, %xmm0             
1101         sete            %al                      
1102 .endif                                           
1103         // No need for vzeroupper here, since    
1104         RET                                      
1105 .endm                                            
1106                                                  
1107 _set_veclen 32                                   
1108 SYM_FUNC_START(aes_gcm_precompute_vaes_avx10_    
1109         _aes_gcm_precompute                      
1110 SYM_FUNC_END(aes_gcm_precompute_vaes_avx10_25    
1111 SYM_FUNC_START(aes_gcm_enc_update_vaes_avx10_    
1112         _aes_gcm_update 1                        
1113 SYM_FUNC_END(aes_gcm_enc_update_vaes_avx10_25    
1114 SYM_FUNC_START(aes_gcm_dec_update_vaes_avx10_    
1115         _aes_gcm_update 0                        
1116 SYM_FUNC_END(aes_gcm_dec_update_vaes_avx10_25    
1117                                                  
1118 _set_veclen 64                                   
1119 SYM_FUNC_START(aes_gcm_precompute_vaes_avx10_    
1120         _aes_gcm_precompute                      
1121 SYM_FUNC_END(aes_gcm_precompute_vaes_avx10_51    
1122 SYM_FUNC_START(aes_gcm_enc_update_vaes_avx10_    
1123         _aes_gcm_update 1                        
1124 SYM_FUNC_END(aes_gcm_enc_update_vaes_avx10_51    
1125 SYM_FUNC_START(aes_gcm_dec_update_vaes_avx10_    
1126         _aes_gcm_update 0                        
1127 SYM_FUNC_END(aes_gcm_dec_update_vaes_avx10_51    
1128                                                  
1129 // void aes_gcm_aad_update_vaes_avx10(const s    
1130 //                                    u8 ghas    
1131 //                                    const u    
1132 //                                               
1133 // This function processes the AAD (Additiona    
1134 // Using the key |key|, it updates the GHASH     
1135 // data given by |aad| and |aadlen|.  |key->g    
1136 // initialized.  On the first call, |ghash_ac    
1137 // must be a multiple of 16, except on the la    
1138 // The caller must do any buffering needed to    
1139 //                                               
1140 // AES-GCM is almost always used with small a    
1141 // Therefore, for AAD processing we currently    
1142 // which uses 256-bit vectors (ymm registers)    
1143 // keeps the code size down, and it enables s    
1144 // VEX-coded instructions instead of EVEX-cod    
1145 // To optimize for large amounts of AAD, we c    
1146 // provide a version using 512-bit vectors, b    
1147 SYM_FUNC_START(aes_gcm_aad_update_vaes_avx10)    
1148                                                  
1149         // Function arguments                    
1150         .set    KEY,            %rdi             
1151         .set    GHASH_ACC_PTR,  %rsi             
1152         .set    AAD,            %rdx             
1153         .set    AADLEN,         %ecx             
1154         .set    AADLEN64,       %rcx    // Ze    
1155                                                  
1156         // Additional local variables.           
1157         // %rax, %ymm0-%ymm3, and %k1 are use    
1158         .set    BSWAP_MASK,     %ymm4            
1159         .set    GFPOLY,         %ymm5            
1160         .set    GHASH_ACC,      %ymm6            
1161         .set    GHASH_ACC_XMM,  %xmm6            
1162         .set    H_POW1,         %ymm7            
1163                                                  
1164         // Load some constants.                  
1165         vbroadcasti128  .Lbswap_mask(%rip), B    
1166         vbroadcasti128  .Lgfpoly(%rip), GFPOL    
1167                                                  
1168         // Load the GHASH accumulator.           
1169         vmovdqu         (GHASH_ACC_PTR), GHAS    
1170                                                  
1171         // Update GHASH with 32 bytes of AAD     
1172         //                                       
1173         // Pre-subtracting 32 from AADLEN sav    
1174         // also ensures that at least one wri    
1175         // zero-extending it and allowing AAD    
1176         sub             $32, AADLEN              
1177         jl              .Laad_loop_1x_done       
1178         vmovdqu8        OFFSETOFEND_H_POWERS-    
1179 .Laad_loop_1x:                                   
1180         vmovdqu         (AAD), %ymm0             
1181         vpshufb         BSWAP_MASK, %ymm0, %y    
1182         vpxor           %ymm0, GHASH_ACC, GHA    
1183         _ghash_mul      H_POW1, GHASH_ACC, GH    
1184                         %ymm0, %ymm1, %ymm2      
1185         vextracti128    $1, GHASH_ACC, %xmm0     
1186         vpxor           %xmm0, GHASH_ACC_XMM,    
1187         add             $32, AAD                 
1188         sub             $32, AADLEN              
1189         jge             .Laad_loop_1x            
1190 .Laad_loop_1x_done:                              
1191         add             $32, AADLEN              
1192         jz              .Laad_done               
1193                                                  
1194         // Update GHASH with the remaining 1     
1195         mov             $-1, %eax                
1196         bzhi            AADLEN, %eax, %eax       
1197         kmovd           %eax, %k1                
1198         vmovdqu8        (AAD), %ymm0{%k1}{z}     
1199         neg             AADLEN64                 
1200         and             $~15, AADLEN64  // -r    
1201         vmovdqu8        OFFSETOFEND_H_POWERS(    
1202         vpshufb         BSWAP_MASK, %ymm0, %y    
1203         vpxor           %ymm0, GHASH_ACC, GHA    
1204         _ghash_mul      H_POW1, GHASH_ACC, GH    
1205                         %ymm0, %ymm1, %ymm2      
1206         vextracti128    $1, GHASH_ACC, %xmm0     
1207         vpxor           %xmm0, GHASH_ACC_XMM,    
1208                                                  
1209 .Laad_done:                                      
1210         // Store the updated GHASH accumulato    
1211         vmovdqu         GHASH_ACC_XMM, (GHASH    
1212                                                  
1213         vzeroupper      // This is needed aft    
1214         RET                                      
1215 SYM_FUNC_END(aes_gcm_aad_update_vaes_avx10)      
1216                                                  
1217 SYM_FUNC_START(aes_gcm_enc_final_vaes_avx10)     
1218         _aes_gcm_final  1                        
1219 SYM_FUNC_END(aes_gcm_enc_final_vaes_avx10)       
1220 SYM_FUNC_START(aes_gcm_dec_final_vaes_avx10)     
1221         _aes_gcm_final  0                        
1222 SYM_FUNC_END(aes_gcm_dec_final_vaes_avx10)
~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.
TOMOYO Linux Cross Reference Linux/arch/x86/crypto/aes-gcm-avx10-x86_64.S

Diff markup

Differences between /arch/x86/crypto/aes-gcm-avx10-x86_64.S (Architecture mips) and /arch/sparc64/crypto/aes-gcm-avx10-x86_64.S (Architecture sparc64)

TOMOYO Linux Cross Reference
Linux/arch/x86/crypto/aes-gcm-avx10-x86_64.S