I think I have found a way to avoid the gcc crazyness.
Lookie here:
# TIME[s] SPEED[MB/s]
rfc3174 5.094 119.8
rfc3174 5.098 119.7
linus 1.462 417.5
linusas 2.008 304
linusas2 1.878 325
mozilla 5.566 109.6
mozillaas 5.866 104.1
openssl 1.609 379.3
spelvin 1.675 364.5
spelvina 1.601 381.3
nettle 1.591 383.6
notice? I outperform all the hand-tuned asm on 32-bit too. By quite a
margin, in fact.
Now, I didn't try a P4, and it's possible that it won't do that there, but
the 32-bit code generation sure looks impressive on my Nehalem box. The
magic? I force the stores to the 512-bit hash bucket to be done in order.
That seems to help a lot.
The diff is trivial (on top of the "rename registers with cpp" patch), as
appended. And it does seem to fix the P4 issues too, although I can
obviously (once again) only test Prescott, and only in 64-bit mode:
# TIME[s] SPEED[MB/s]
rfc3174 1.662 36.73
rfc3174 1.64 37.22
linus 0.2523 241.9
linusas 0.4367 139.8
linusas2 0.4487 136
mozilla 0.9704 62.9
mozillaas 0.9399 64.94
that's some really impressive improvement. All from just saying "do the
stores in the order I told you to, dammit!" to the compiler.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Instead of letting the compiler to figure out the optimal way to rotate
register usage, explicitly rotate the register names with cpp.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
.. and simplify the ctx->size logic.
We now count the size in bytes, which means that 'lenW' was always just
the low 6 bits of the total size, so we don't carry it around separately
any more. And we do the 'size in bits' shift at the end.
Suggested by Nicolas Pitre and linux@horizon.com.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
It's an equivalent expression, but the '+' gives us some freedom in
instruction selection (for example, we can use 'lea' rather than 'add'),
and associates with the other additions around it to give some minor
scheduling freedom.
Suggested-by: linux@horizon.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Avoid repeating the shared parts of the different rounds by adding a
macro layer or two. It was already more cpp than C.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The mozilla-SHA1 code did this 80-word array for the 80 iterations. But
the SHA1 state is really just 512 bits, and you can actually keep it in
a kind of "circular queue" of just 16 words instead.
This requires us to do the xor updates as we go along (rather than as a
pre-phase), but that's really what we want to do anyway.
This gets me really close to the OpenSSL performance on my Nehalem.
Look ma, all C code (ok, there's the rol/ror hack, but that one doesn't
strictly even matter on my Nehalem, it's just a local optimization).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This helps a teeny bit. But what I -really- want to do is to avoid the
whole 80-array loop, and do the xor updates as I go along..
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Bert Wesarg noticed non-x86 version of SHA_ROT() had a typo.
Also spell in-line assembly as __asm__(), otherwise I seem to get
error: implicit declaration of function 'asm' from my compiler.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Use the one with the smaller constant. It _can_ generate slightly
smaller code (a constant of 1 is special), but perhaps more importantly
it's possibly faster on any uarch that does a rotate with a loop.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Undo the change I picked up from the mailing list discussion suggested
by Nico, not because it is wrong, but it will be done at the end of the
follow-up series.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Based on the mozilla SHA1 routine, but doing the input data accesses a
word at a time and with 'htonl()' instead of loading bytes and shifting.
It requires an architecture that is ok with unaligned 32-bit loads and a
fast htonl().
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>