mirrors/git - Incest Forge: Beyond sex. We incest.

mirrors/git

mirror of https://github.com/git/git.git synced 2024-11-18 06:54:55 +01:00

284 lines

8.4 KiB

C

Raw Normal View History

Add new optimized C 'block-sha1' routines Based on the mozilla SHA1 routine, but doing the input data accesses a word at a time and with 'htonl()' instead of loading bytes and shifting. It requires an architecture that is ok with unaligned 32-bit loads and a fast htonl(). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 01:13:20 +02:00			`/*`
remove ARM and Mozilla SHA1 implementations They are both slower than the new BLK_SHA1 implementation, so it is pointless to keep them around. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-18 02:09:56 +02:00			`* SHA1 routine optimized to do word accesses rather than byte accesses,`
Add new optimized C 'block-sha1' routines Based on the mozilla SHA1 routine, but doing the input data accesses a word at a time and with 'htonl()' instead of loading bytes and shifting. It requires an architecture that is ok with unaligned 32-bit loads and a fast htonl(). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 01:13:20 +02:00			`* and to avoid unnecessary copies into the context array.`
remove ARM and Mozilla SHA1 implementations They are both slower than the new BLK_SHA1 implementation, so it is pointless to keep them around. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-18 02:09:56 +02:00			`*`
			`* This was initially based on the Mozilla SHA1 implementation, although`
			`* none of the original Mozilla code remains.`
Add new optimized C 'block-sha1' routines Based on the mozilla SHA1 routine, but doing the input data accesses a word at a time and with 'htonl()' instead of loading bytes and shifting. It requires an architecture that is ok with unaligned 32-bit loads and a fast htonl(). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 01:13:20 +02:00			`*/`

make sure byte swapping is optimal for git We rely on ntohl() and htonl() to perform byte swapping in many places. However, some platforms have libraries providing really poor implementations of those which might cause significant performance issues, especially with the block-sha1 code. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-18 21:26:55 +02:00			`/* this is only to get definitions for memcpy(), ntohl() and htonl() */`
			`#include "../git-compat-util.h"`
Add new optimized C 'block-sha1' routines Based on the mozilla SHA1 routine, but doing the input data accesses a word at a time and with 'htonl()' instead of loading bytes and shifting. It requires an architecture that is ok with unaligned 32-bit loads and a fast htonl(). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 01:13:20 +02:00
			`#include "sha1.h"`

block-sha1: guard gcc extensions with __GNUC__ With this, the code should now be portable to any C compiler. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-18 21:37:22 +02:00			`#if defined(__GNUC__) && (defined(__i386__) \|\| defined(__x86_64__))`
block-sha1: try to use rol/ror appropriately Use the one with the smaller constant. It _can_ generate slightly smaller code (a constant of 1 is special), but perhaps more importantly it's possibly faster on any uarch that does a rotate with a loop. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 04:42:15 +02:00
block-sha1: split the different "hacks" to be individually selected This is to make it easier for them to be selected individually depending on the architecture instead of the other way around i.e. having each architecture select a list of hacks up front. That makes for clearer documentation as well. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-12 21:46:41 +02:00			`/*`
			`* Force usage of rol or ror by selecting the one with the smaller constant.`
			`* It _can_ generate slightly smaller code (a constant of 1 is special), but`
			`* perhaps more importantly it's possibly faster on any uarch that does a`
			`* rotate with a loop.`
			`*/`

block-sha1: minor fixups Bert Wesarg noticed non-x86 version of SHA_ROT() had a typo. Also spell in-line assembly as __asm__(), otherwise I seem to get error: implicit declaration of function 'asm' from my compiler. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 22:52:58 +02:00			`#define SHA_ASM(op, x, n) ({ unsigned int __res; __asm__(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; })`
block-sha1: try to use rol/ror appropriately Use the one with the smaller constant. It _can_ generate slightly smaller code (a constant of 1 is special), but perhaps more importantly it's possibly faster on any uarch that does a rotate with a loop. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 04:42:15 +02:00			`#define SHA_ROL(x,n) SHA_ASM("rol", x, n)`
			`#define SHA_ROR(x,n) SHA_ASM("ror", x, n)`

			`#else`

block-sha1: minor fixups Bert Wesarg noticed non-x86 version of SHA_ROT() had a typo. Also spell in-line assembly as __asm__(), otherwise I seem to get error: implicit declaration of function 'asm' from my compiler. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 22:52:58 +02:00			`#define SHA_ROT(X,l,r) (((X) << (l)) \| ((X) >> (r)))`
block-sha1: try to use rol/ror appropriately Use the one with the smaller constant. It _can_ generate slightly smaller code (a constant of 1 is special), but perhaps more importantly it's possibly faster on any uarch that does a rotate with a loop. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 04:42:15 +02:00			`#define SHA_ROL(X,n) SHA_ROT(X,n,32-(n))`
			`#define SHA_ROR(X,n) SHA_ROT(X,32-(n),n)`

			`#endif`
Add new optimized C 'block-sha1' routines Based on the mozilla SHA1 routine, but doing the input data accesses a word at a time and with 'htonl()' instead of loading bytes and shifting. It requires an architecture that is ok with unaligned 32-bit loads and a fast htonl(). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 01:13:20 +02:00
block-sha1: improve code on large-register-set machines For x86 performance (especially in 32-bit mode) I added that hack to write the SHA1 internal temporary hash using a volatile pointer, in order to get gcc to not try to cache the array contents. Because gcc will do all the wrong things, and then spill things in insane random ways. But on architectures like PPC, where you have 32 registers, it's actually perfectly reasonable to put the whole temporary array[] into the register set, and gcc can do so. So make the 'volatile unsigned int ' cast be dependent on a SMALL_REGISTER_SET preprocessor symbol, and enable it (currently) on just x86 and x86-64. With that, the routine is fairly reasonable even when compared to the hand-scheduled PPC version. Ben Herrenschmidt reports on a G5: Paulus asm version: about 3.67s * Yours with no change: about 5.74s * Yours without "volatile": about 3.78s so with this the C version is within about 3% of the asm one. And add a lot of commentary on what the heck is going on. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-11 01:52:07 +02:00			`/*`
			`* If you have 32 registers or more, the compiler can (and should)`
			`* try to change the array[] accesses into registers. However, on`
			`* machines with less than ~25 registers, that won't really work,`
			`* and at least gcc will make an unholy mess of it.`
			`*`
			`* So to avoid that mess which just slows things down, we force`
			`* the stores to memory to actually happen (we might be better off`
			`* with a 'W(t)=(val);asm("":"+m" (W(t))' there instead, as`
			`* suggested by Artur Skawina - that will also make gcc unable to`
			`* try to do the silly "optimize away loads" part because it won't`
			`* see what the value will be).`
			`*`
			`* Ben Herrenschmidt reports that on PPC, the C version comes close`
			`* to the optimized asm with this (ie on PPC you don't want that`
			`* 'volatile', since there are lots of registers).`
block-sha1: split the different "hacks" to be individually selected This is to make it easier for them to be selected individually depending on the architecture instead of the other way around i.e. having each architecture select a list of hacks up front. That makes for clearer documentation as well. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-12 21:46:41 +02:00			`*`
			`* On ARM we get the best code generation by forcing a full memory barrier`
			`* between each SHA_ROUND, otherwise gcc happily get wild with spilling and`
			`* the stack frame size simply explode and performance goes down the drain.`
block-sha1: improve code on large-register-set machines For x86 performance (especially in 32-bit mode) I added that hack to write the SHA1 internal temporary hash using a volatile pointer, in order to get gcc to not try to cache the array contents. Because gcc will do all the wrong things, and then spill things in insane random ways. But on architectures like PPC, where you have 32 registers, it's actually perfectly reasonable to put the whole temporary array[] into the register set, and gcc can do so. So make the 'volatile unsigned int ' cast be dependent on a SMALL_REGISTER_SET preprocessor symbol, and enable it (currently) on just x86 and x86-64. With that, the routine is fairly reasonable even when compared to the hand-scheduled PPC version. Ben Herrenschmidt reports on a G5: Paulus asm version: about 3.67s * Yours with no change: about 5.74s * Yours without "volatile": about 3.78s so with this the C version is within about 3% of the asm one. And add a lot of commentary on what the heck is going on. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-11 01:52:07 +02:00			`*/`
block-sha1: split the different "hacks" to be individually selected This is to make it easier for them to be selected individually depending on the architecture instead of the other way around i.e. having each architecture select a list of hacks up front. That makes for clearer documentation as well. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-12 21:46:41 +02:00
			`#if defined(__i386__) \|\| defined(__x86_64__)`
block-sha1: improve code on large-register-set machines For x86 performance (especially in 32-bit mode) I added that hack to write the SHA1 internal temporary hash using a volatile pointer, in order to get gcc to not try to cache the array contents. Because gcc will do all the wrong things, and then spill things in insane random ways. But on architectures like PPC, where you have 32 registers, it's actually perfectly reasonable to put the whole temporary array[] into the register set, and gcc can do so. So make the 'volatile unsigned int ' cast be dependent on a SMALL_REGISTER_SET preprocessor symbol, and enable it (currently) on just x86 and x86-64. With that, the routine is fairly reasonable even when compared to the hand-scheduled PPC version. Ben Herrenschmidt reports on a G5: Paulus asm version: about 3.67s * Yours with no change: about 5.74s * Yours without "volatile": about 3.78s so with this the C version is within about 3% of the asm one. And add a lot of commentary on what the heck is going on. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-11 01:52:07 +02:00			`#define setW(x, val) ((volatile unsigned int )&W(x) = (val))`
block-sha1: guard gcc extensions with __GNUC__ With this, the code should now be portable to any C compiler. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-18 21:37:22 +02:00			`#elif defined(__GNUC__) && defined(__arm__)`
block-sha1: split the different "hacks" to be individually selected This is to make it easier for them to be selected individually depending on the architecture instead of the other way around i.e. having each architecture select a list of hacks up front. That makes for clearer documentation as well. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-12 21:46:41 +02:00			`#define setW(x, val) do { W(x) = (val); __asm__("":::"memory"); } while (0)`
block-sha1: improve code on large-register-set machines For x86 performance (especially in 32-bit mode) I added that hack to write the SHA1 internal temporary hash using a volatile pointer, in order to get gcc to not try to cache the array contents. Because gcc will do all the wrong things, and then spill things in insane random ways. But on architectures like PPC, where you have 32 registers, it's actually perfectly reasonable to put the whole temporary array[] into the register set, and gcc can do so. So make the 'volatile unsigned int ' cast be dependent on a SMALL_REGISTER_SET preprocessor symbol, and enable it (currently) on just x86 and x86-64. With that, the routine is fairly reasonable even when compared to the hand-scheduled PPC version. Ben Herrenschmidt reports on a G5: Paulus asm version: about 3.67s * Yours with no change: about 5.74s * Yours without "volatile": about 3.78s so with this the C version is within about 3% of the asm one. And add a lot of commentary on what the heck is going on. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-11 01:52:07 +02:00			`#else`
			`#define setW(x, val) (W(x) = (val))`
			`#endif`
block-sha1: macroize the rounds a bit further Avoid repeating the shared parts of the different rounds by adding a macro layer or two. It was already more cpp than C. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 16:20:54 +02:00
block-sha1: support for architectures with memory alignment restrictions This is needed on architectures with poor or non-existent unaligned memory support and/or no fast byte swap instruction (such as ARM) by using byte accesses to memory and shifting the result together. This also makes the code portable, therefore the byte access methods are the defaults. Any architecture that properly supports unaligned word accesses in hardware simply has to enable the alternative methods. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-12 21:47:55 +02:00			`/*`
			`* Performance might be improved if the CPU architecture is OK with`
			`* unaligned 32-bit loads and a fast ntohl() is available.`
			`* Otherwise fall back to byte loads and shifts which is portable,`
			`* and is faster on architectures with memory alignment issues.`
			`*/`

block-sha1: more good unaligned memory access candidates In addition to X86, PowerPC and S390 are capable of unaligned memory accesses. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-13 06:29:14 +02:00			`#if defined(__i386__) \|\| defined(__x86_64__) \|\| \`
msvc: Select the "fast" definition of the {get,put}_be32() macros On Intel machines, the msvc compiler defines the CPU architecture macros _M_IX86 and _M_X64 (equivalent to __i386__ and __x86_64__ respectively). Use these macros in the pre-processor expression to select the "fast" definition of the {get,put}_be32() macros. Signed-off-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk> Acked-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2010-06-23 21:47:02 +02:00			`defined(_M_IX86) \|\| defined(_M_X64) \|\| \`
block-sha1: more good unaligned memory access candidates In addition to X86, PowerPC and S390 are capable of unaligned memory accesses. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-13 06:29:14 +02:00			`defined(__ppc__) \|\| defined(__ppc64__) \|\| \`
			`defined(__powerpc__) \|\| defined(__powerpc64__) \|\| \`
			`defined(__s390__) \|\| defined(__s390x__)`
block-sha1: support for architectures with memory alignment restrictions This is needed on architectures with poor or non-existent unaligned memory support and/or no fast byte swap instruction (such as ARM) by using byte accesses to memory and shifting the result together. This also makes the code portable, therefore the byte access methods are the defaults. Any architecture that properly supports unaligned word accesses in hardware simply has to enable the alternative methods. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-12 21:47:55 +02:00
			`#define get_be32(p) ntohl((unsigned int )(p))`
			`#define put_be32(p, v) do { (unsigned int )(p) = htonl(v); } while (0)`

			`#else`

			`#define get_be32(p) ( \`
			`(((unsigned char )(p) + 0) << 24) \| \`
			`(((unsigned char )(p) + 1) << 16) \| \`
			`(((unsigned char )(p) + 2) << 8) \| \`
			`(((unsigned char )(p) + 3) << 0) )`
			`#define put_be32(p, v) do { \`
			`unsigned int __v = (v); \`
			`((unsigned char )(p) + 0) = __v >> 24; \`
			`((unsigned char )(p) + 1) = __v >> 16; \`
			`((unsigned char )(p) + 2) = __v >> 8; \`
			`((unsigned char )(p) + 3) = __v >> 0; } while (0)`

			`#endif`

block-sha1: split the different "hacks" to be individually selected This is to make it easier for them to be selected individually depending on the architecture instead of the other way around i.e. having each architecture select a list of hacks up front. That makes for clearer documentation as well. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-12 21:46:41 +02:00			`/* This "rolls" over the 512-bit array */`
			`#define W(x) (array[(x)&15])`

block-sha1: macroize the rounds a bit further Avoid repeating the shared parts of the different rounds by adding a macro layer or two. It was already more cpp than C. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 16:20:54 +02:00			`/*`
			`* Where do we get the source from? The first 16 iterations get it from`
			`* the input data, the next mix it from the 512-bit array.`
			`*/`
block-sha1: put expanded macro parameters in parentheses 't' is currently always a numeric constant, but it can't hurt to prepare for the day that it becomes useful for a caller to pass in a more complex expression. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2012-07-23 01:40:54 +02:00			`#define SHA_SRC(t) get_be32((unsigned char ) block + (t)4)`
			`#define SHA_MIX(t) SHA_ROL(W((t)+13) ^ W((t)+8) ^ W((t)+2) ^ W(t), 1);`
block-sha1: macroize the rounds a bit further Avoid repeating the shared parts of the different rounds by adding a macro layer or two. It was already more cpp than C. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 16:20:54 +02:00
block-sha1: perform register rotation using cpp Instead of letting the compiler to figure out the optimal way to rotate register usage, explicitly rotate the register names with cpp. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 21:41:00 +02:00			`#define SHA_ROUND(t, input, fn, constant, A, B, C, D, E) do { \`
block-sha1: improved SHA1 hashing I think I have found a way to avoid the gcc crazyness. Lookie here: # TIME[s] SPEED[MB/s] rfc3174 5.094 119.8 rfc3174 5.098 119.7 linus 1.462 417.5 linusas 2.008 304 linusas2 1.878 325 mozilla 5.566 109.6 mozillaas 5.866 104.1 openssl 1.609 379.3 spelvin 1.675 364.5 spelvina 1.601 381.3 nettle 1.591 383.6 notice? I outperform all the hand-tuned asm on 32-bit too. By quite a margin, in fact. Now, I didn't try a P4, and it's possible that it won't do that there, but the 32-bit code generation sure looks impressive on my Nehalem box. The magic? I force the stores to the 512-bit hash bucket to be done in order. That seems to help a lot. The diff is trivial (on top of the "rename registers with cpp" patch), as appended. And it does seem to fix the P4 issues too, although I can obviously (once again) only test Prescott, and only in 64-bit mode: # TIME[s] SPEED[MB/s] rfc3174 1.662 36.73 rfc3174 1.64 37.22 linus 0.2523 241.9 linusas 0.4367 139.8 linusas2 0.4487 136 mozilla 0.9704 62.9 mozillaas 0.9399 64.94 that's some really impressive improvement. All from just saying "do the stores in the order I told you to, dammit!" to the compiler. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-08 06:16:46 +02:00			`unsigned int TEMP = input(t); setW(t, TEMP); \`
			`E += TEMP + SHA_ROL(A,5) + (fn) + (constant); \`
			`B = SHA_ROR(B, 2); } while (0)`
block-sha1: macroize the rounds a bit further Avoid repeating the shared parts of the different rounds by adding a macro layer or two. It was already more cpp than C. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 16:20:54 +02:00
block-sha1: perform register rotation using cpp Instead of letting the compiler to figure out the optimal way to rotate register usage, explicitly rotate the register names with cpp. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 21:41:00 +02:00			`#define T_0_15(t, A, B, C, D, E) SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )`
			`#define T_16_19(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )`
			`#define T_20_39(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0x6ed9eba1, A, B, C, D, E )`
			`#define T_40_59(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, ((B&C)+(D&(B^C))) , 0x8f1bbcdc, A, B, C, D, E )`
			`#define T_60_79(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0xca62c1d6, A, B, C, D, E )`
block-sha1: macroize the rounds a bit further Avoid repeating the shared parts of the different rounds by adding a macro layer or two. It was already more cpp than C. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 16:20:54 +02:00
block-sha1: avoid pointer conversion that violates alignment constraints With 660231aa (block-sha1: support for architectures with memory alignment restrictions, 2009-08-12), blk_SHA1_Update was modified to access 32-bit chunks of memory one byte at a time on arches that prefer that: #define get_be32(p) ( \ (((unsigned char )(p) + 0) << 24) \| \ (((unsigned char )(p) + 1) << 16) \| \ (((unsigned char )(p) + 2) << 8) \| \ (((unsigned char )(p) + 3) << 0) ) The code previously accessed these values by just using htonl(p). Unfortunately, Michael noticed on an Alpha machine that git was using plain 32-bit reads anyway. As soon as we convert a pointer to int , the compiler can assume that the object pointed to is correctly aligned as an int (C99 section 6.3.2.3 "pointer conversions" paragraph 7), and gcc takes full advantage by using a single 32-bit load, resulting in a whole bunch of unaligned access traps. So we need to obey the alignment constraints even when only dealing with pointers instead of actual values. Do so by changing the type of 'data' to void *. This patch renames 'data' to 'block' at the same time to make sure all references are updated to reflect the new type. Reported-tested-and-explained-by: Michael Cree <mcree@orcon.net.nz> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2012-07-23 01:39:54 +02:00			`static void blk_SHA1_Block(blk_SHA_CTX ctx, const void block)`
Add new optimized C 'block-sha1' routines Based on the mozilla SHA1 routine, but doing the input data accesses a word at a time and with 'htonl()' instead of loading bytes and shifting. It requires an architecture that is ok with unaligned 32-bit loads and a fast htonl(). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 01:13:20 +02:00			`{`
block-sha1: perform register rotation using cpp Instead of letting the compiler to figure out the optimal way to rotate register usage, explicitly rotate the register names with cpp. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 21:41:00 +02:00			`unsigned int A,B,C,D,E;`
block-sha1: re-use the temporary array as we calculate the SHA1 The mozilla-SHA1 code did this 80-word array for the 80 iterations. But the SHA1 state is really just 512 bits, and you can actually keep it in a kind of "circular queue" of just 16 words instead. This requires us to do the xor updates as we go along (rather than as a pre-phase), but that's really what we want to do anyway. This gets me really close to the OpenSSL performance on my Nehalem. Look ma, all C code (ok, there's the rol/ror hack, but that one doesn't strictly even matter on my Nehalem, it's just a local optimization). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 05:49:41 +02:00			`unsigned int array[16];`
Add new optimized C 'block-sha1' routines Based on the mozilla SHA1 routine, but doing the input data accesses a word at a time and with 'htonl()' instead of loading bytes and shifting. It requires an architecture that is ok with unaligned 32-bit loads and a fast htonl(). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 01:13:20 +02:00
			`A = ctx->H[0];`
			`B = ctx->H[1];`
			`C = ctx->H[2];`
			`D = ctx->H[3];`
			`E = ctx->H[4];`

block-sha1: avoid pointer conversion that violates alignment constraints With 660231aa (block-sha1: support for architectures with memory alignment restrictions, 2009-08-12), blk_SHA1_Update was modified to access 32-bit chunks of memory one byte at a time on arches that prefer that: #define get_be32(p) ( \ (((unsigned char )(p) + 0) << 24) \| \ (((unsigned char )(p) + 1) << 16) \| \ (((unsigned char )(p) + 2) << 8) \| \ (((unsigned char )(p) + 3) << 0) ) The code previously accessed these values by just using htonl(p). Unfortunately, Michael noticed on an Alpha machine that git was using plain 32-bit reads anyway. As soon as we convert a pointer to int , the compiler can assume that the object pointed to is correctly aligned as an int (C99 section 6.3.2.3 "pointer conversions" paragraph 7), and gcc takes full advantage by using a single 32-bit load, resulting in a whole bunch of unaligned access traps. So we need to obey the alignment constraints even when only dealing with pointers instead of actual values. Do so by changing the type of 'data' to void *. This patch renames 'data' to 'block' at the same time to make sure all references are updated to reflect the new type. Reported-tested-and-explained-by: Michael Cree <mcree@orcon.net.nz> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2012-07-23 01:39:54 +02:00			`/* Round 1 - iterations 0-16 take their input from 'block' */`
block-sha1: perform register rotation using cpp Instead of letting the compiler to figure out the optimal way to rotate register usage, explicitly rotate the register names with cpp. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 21:41:00 +02:00			`T_0_15( 0, A, B, C, D, E);`
			`T_0_15( 1, E, A, B, C, D);`
			`T_0_15( 2, D, E, A, B, C);`
			`T_0_15( 3, C, D, E, A, B);`
			`T_0_15( 4, B, C, D, E, A);`
			`T_0_15( 5, A, B, C, D, E);`
			`T_0_15( 6, E, A, B, C, D);`
			`T_0_15( 7, D, E, A, B, C);`
			`T_0_15( 8, C, D, E, A, B);`
			`T_0_15( 9, B, C, D, E, A);`
			`T_0_15(10, A, B, C, D, E);`
			`T_0_15(11, E, A, B, C, D);`
			`T_0_15(12, D, E, A, B, C);`
			`T_0_15(13, C, D, E, A, B);`
			`T_0_15(14, B, C, D, E, A);`
			`T_0_15(15, A, B, C, D, E);`
block-sha1: make the 'ntohl()' part of the first SHA1 loop This helps a teeny bit. But what I -really- want to do is to avoid the whole 80-array loop, and do the xor updates as I go along.. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 05:28:07 +02:00
block-sha1: macroize the rounds a bit further Avoid repeating the shared parts of the different rounds by adding a macro layer or two. It was already more cpp than C. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 16:20:54 +02:00			`/* Round 1 - tail. Input from 512-bit mixing array */`
block-sha1: perform register rotation using cpp Instead of letting the compiler to figure out the optimal way to rotate register usage, explicitly rotate the register names with cpp. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 21:41:00 +02:00			`T_16_19(16, E, A, B, C, D);`
			`T_16_19(17, D, E, A, B, C);`
			`T_16_19(18, C, D, E, A, B);`
			`T_16_19(19, B, C, D, E, A);`
Add new optimized C 'block-sha1' routines Based on the mozilla SHA1 routine, but doing the input data accesses a word at a time and with 'htonl()' instead of loading bytes and shifting. It requires an architecture that is ok with unaligned 32-bit loads and a fast htonl(). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 01:13:20 +02:00
block-sha1: macroize the rounds a bit further Avoid repeating the shared parts of the different rounds by adding a macro layer or two. It was already more cpp than C. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 16:20:54 +02:00			`/* Round 2 */`
block-sha1: perform register rotation using cpp Instead of letting the compiler to figure out the optimal way to rotate register usage, explicitly rotate the register names with cpp. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 21:41:00 +02:00			`T_20_39(20, A, B, C, D, E);`
			`T_20_39(21, E, A, B, C, D);`
			`T_20_39(22, D, E, A, B, C);`
			`T_20_39(23, C, D, E, A, B);`
			`T_20_39(24, B, C, D, E, A);`
			`T_20_39(25, A, B, C, D, E);`
			`T_20_39(26, E, A, B, C, D);`
			`T_20_39(27, D, E, A, B, C);`
			`T_20_39(28, C, D, E, A, B);`
			`T_20_39(29, B, C, D, E, A);`
			`T_20_39(30, A, B, C, D, E);`
			`T_20_39(31, E, A, B, C, D);`
			`T_20_39(32, D, E, A, B, C);`
			`T_20_39(33, C, D, E, A, B);`
			`T_20_39(34, B, C, D, E, A);`
			`T_20_39(35, A, B, C, D, E);`
			`T_20_39(36, E, A, B, C, D);`
			`T_20_39(37, D, E, A, B, C);`
			`T_20_39(38, C, D, E, A, B);`
			`T_20_39(39, B, C, D, E, A);`
Add new optimized C 'block-sha1' routines Based on the mozilla SHA1 routine, but doing the input data accesses a word at a time and with 'htonl()' instead of loading bytes and shifting. It requires an architecture that is ok with unaligned 32-bit loads and a fast htonl(). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 01:13:20 +02:00
block-sha1: macroize the rounds a bit further Avoid repeating the shared parts of the different rounds by adding a macro layer or two. It was already more cpp than C. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 16:20:54 +02:00			`/* Round 3 */`
block-sha1: perform register rotation using cpp Instead of letting the compiler to figure out the optimal way to rotate register usage, explicitly rotate the register names with cpp. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 21:41:00 +02:00			`T_40_59(40, A, B, C, D, E);`
			`T_40_59(41, E, A, B, C, D);`
			`T_40_59(42, D, E, A, B, C);`
			`T_40_59(43, C, D, E, A, B);`
			`T_40_59(44, B, C, D, E, A);`
			`T_40_59(45, A, B, C, D, E);`
			`T_40_59(46, E, A, B, C, D);`
			`T_40_59(47, D, E, A, B, C);`
			`T_40_59(48, C, D, E, A, B);`
			`T_40_59(49, B, C, D, E, A);`
			`T_40_59(50, A, B, C, D, E);`
			`T_40_59(51, E, A, B, C, D);`
			`T_40_59(52, D, E, A, B, C);`
			`T_40_59(53, C, D, E, A, B);`
			`T_40_59(54, B, C, D, E, A);`
			`T_40_59(55, A, B, C, D, E);`
			`T_40_59(56, E, A, B, C, D);`
			`T_40_59(57, D, E, A, B, C);`
			`T_40_59(58, C, D, E, A, B);`
			`T_40_59(59, B, C, D, E, A);`
Add new optimized C 'block-sha1' routines Based on the mozilla SHA1 routine, but doing the input data accesses a word at a time and with 'htonl()' instead of loading bytes and shifting. It requires an architecture that is ok with unaligned 32-bit loads and a fast htonl(). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 01:13:20 +02:00
block-sha1: macroize the rounds a bit further Avoid repeating the shared parts of the different rounds by adding a macro layer or two. It was already more cpp than C. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 16:20:54 +02:00			`/* Round 4 */`
block-sha1: perform register rotation using cpp Instead of letting the compiler to figure out the optimal way to rotate register usage, explicitly rotate the register names with cpp. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 21:41:00 +02:00			`T_60_79(60, A, B, C, D, E);`
			`T_60_79(61, E, A, B, C, D);`
			`T_60_79(62, D, E, A, B, C);`
			`T_60_79(63, C, D, E, A, B);`
			`T_60_79(64, B, C, D, E, A);`
			`T_60_79(65, A, B, C, D, E);`
			`T_60_79(66, E, A, B, C, D);`
			`T_60_79(67, D, E, A, B, C);`
			`T_60_79(68, C, D, E, A, B);`
			`T_60_79(69, B, C, D, E, A);`
			`T_60_79(70, A, B, C, D, E);`
			`T_60_79(71, E, A, B, C, D);`
			`T_60_79(72, D, E, A, B, C);`
			`T_60_79(73, C, D, E, A, B);`
			`T_60_79(74, B, C, D, E, A);`
			`T_60_79(75, A, B, C, D, E);`
			`T_60_79(76, E, A, B, C, D);`
			`T_60_79(77, D, E, A, B, C);`
			`T_60_79(78, C, D, E, A, B);`
			`T_60_79(79, B, C, D, E, A);`
Add new optimized C 'block-sha1' routines Based on the mozilla SHA1 routine, but doing the input data accesses a word at a time and with 'htonl()' instead of loading bytes and shifting. It requires an architecture that is ok with unaligned 32-bit loads and a fast htonl(). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-06 01:13:20 +02:00
			`ctx->H[0] += A;`
			`ctx->H[1] += B;`
			`ctx->H[2] += C;`
			`ctx->H[3] += D;`
			`ctx->H[4] += E;`
			`}`
block-sha1: move code around Move the code around so specific architecture hacks are defined first. Also make one line comments actually one line. No code change. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-12 21:45:48 +02:00
			`void blk_SHA1_Init(blk_SHA_CTX *ctx)`
			`{`
			`ctx->size = 0;`

			`/* Initialize H with the magic constants (see FIPS180 for constants) */`
			`ctx->H[0] = 0x67452301;`
			`ctx->H[1] = 0xefcdab89;`
			`ctx->H[2] = 0x98badcfe;`
			`ctx->H[3] = 0x10325476;`
			`ctx->H[4] = 0xc3d2e1f0;`
			`}`

			`void blk_SHA1_Update(blk_SHA_CTX ctx, const void data, unsigned long len)`
			`{`
msvc: Fix some compiler warnings In particular, using the normal (or production) compiler warning level (-W3), msvc complains as follows: .../sha1.c(244) : warning C4018: '<' : signed/unsigned mismatch .../sha1.c(270) : warning C4244: 'function' : conversion from \ 'unsigned __int64' to 'unsigned long', possible loss of data .../sha1.c(271) : warning C4244: 'function' : conversion from \ 'unsigned __int64' to 'unsigned long', possible loss of data Note that gcc issues a similar complaint about line 244 when compiling with -Wextra. Signed-off-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2010-06-23 21:47:50 +02:00			`unsigned int lenW = ctx->size & 63;`
block-sha1: move code around Move the code around so specific architecture hacks are defined first. Also make one line comments actually one line. No code change. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-12 21:45:48 +02:00
			`ctx->size += len;`

			`/* Read the data into W and process blocks as they get full */`
			`if (lenW) {`
msvc: Fix some compiler warnings In particular, using the normal (or production) compiler warning level (-W3), msvc complains as follows: .../sha1.c(244) : warning C4018: '<' : signed/unsigned mismatch .../sha1.c(270) : warning C4244: 'function' : conversion from \ 'unsigned __int64' to 'unsigned long', possible loss of data .../sha1.c(271) : warning C4244: 'function' : conversion from \ 'unsigned __int64' to 'unsigned long', possible loss of data Note that gcc issues a similar complaint about line 244 when compiling with -Wextra. Signed-off-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2010-06-23 21:47:50 +02:00			`unsigned int left = 64 - lenW;`
block-sha1: move code around Move the code around so specific architecture hacks are defined first. Also make one line comments actually one line. No code change. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-12 21:45:48 +02:00			`if (len < left)`
			`left = len;`
			`memcpy(lenW + (char *)ctx->W, data, left);`
			`lenW = (lenW + left) & 63;`
			`len -= left;`
block-sha1/sha1.c: silence compiler complaints by casting void * to char * Some compilers produce errors when arithmetic is attempted on pointers to void. We want computations done on byte addresses, so cast them to char * to work them around. Signed-off-by: Brandon Casey <casey@nrlssc.navy.mil> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-15 00:52:15 +02:00			`data = ((const char *)data + left);`
block-sha1: move code around Move the code around so specific architecture hacks are defined first. Also make one line comments actually one line. No code change. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-12 21:45:48 +02:00			`if (lenW)`
			`return;`
			`blk_SHA1_Block(ctx, ctx->W);`
			`}`
			`while (len >= 64) {`
			`blk_SHA1_Block(ctx, data);`
block-sha1/sha1.c: silence compiler complaints by casting void * to char * Some compilers produce errors when arithmetic is attempted on pointers to void. We want computations done on byte addresses, so cast them to char * to work them around. Signed-off-by: Brandon Casey <casey@nrlssc.navy.mil> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-15 00:52:15 +02:00			`data = ((const char *)data + 64);`
block-sha1: move code around Move the code around so specific architecture hacks are defined first. Also make one line comments actually one line. No code change. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-12 21:45:48 +02:00			`len -= 64;`
			`}`
			`if (len)`
			`memcpy(ctx->W, data, len);`
			`}`

			`void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)`
			`{`
			`static const unsigned char pad[64] = { 0x80 };`
			`unsigned int padlen[2];`
			`int i;`

			`/* Pad with a binary 1 (ie 0x80), then zeroes, then length */`
msvc: Fix some compiler warnings In particular, using the normal (or production) compiler warning level (-W3), msvc complains as follows: .../sha1.c(244) : warning C4018: '<' : signed/unsigned mismatch .../sha1.c(270) : warning C4244: 'function' : conversion from \ 'unsigned __int64' to 'unsigned long', possible loss of data .../sha1.c(271) : warning C4244: 'function' : conversion from \ 'unsigned __int64' to 'unsigned long', possible loss of data Note that gcc issues a similar complaint about line 244 when compiling with -Wextra. Signed-off-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2010-06-23 21:47:50 +02:00			`padlen[0] = htonl((uint32_t)(ctx->size >> 29));`
			`padlen[1] = htonl((uint32_t)(ctx->size << 3));`
block-sha1: move code around Move the code around so specific architecture hacks are defined first. Also make one line comments actually one line. No code change. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-12 21:45:48 +02:00
			`i = ctx->size & 63;`
			`blk_SHA1_Update(ctx, pad, 1+ (63 & (55 - i)));`
			`blk_SHA1_Update(ctx, padlen, 8);`

			`/* Output hash */`
			`for (i = 0; i < 5; i++)`
block-sha1: support for architectures with memory alignment restrictions This is needed on architectures with poor or non-existent unaligned memory support and/or no fast byte swap instruction (such as ARM) by using byte accesses to memory and shifting the result together. This also makes the code portable, therefore the byte access methods are the defaults. Any architecture that properly supports unaligned word accesses in hardware simply has to enable the alternative methods. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-12 21:47:55 +02:00			`put_be32(hashout + i*4, ctx->H[i]);`
block-sha1: move code around Move the code around so specific architecture hacks are defined first. Also make one line comments actually one line. No code change. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-12 21:45:48 +02:00			`}`