1
0
Fork 0
mirror of https://github.com/git/git.git synced 2024-11-16 22:14:53 +01:00
git/xdiff/xutils.c

483 lines
11 KiB
C
Raw Normal View History

Use a *real* built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 05:13:22 +01:00
/*
* LibXDiff by Davide Libenzi ( File Differential Library )
* Copyright (C) 2003 Davide Libenzi
*
* This library is free software; you can redistribute it and/or
* modify it under the terms of the GNU Lesser General Public
* License as published by the Free Software Foundation; either
* version 2.1 of the License, or (at your option) any later version.
*
* This library is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* Lesser General Public License for more details.
*
* You should have received a copy of the GNU Lesser General Public
* License along with this library; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
*
* Davide Libenzi <davidel@xmailserver.org>
*
*/
xdiff: load full words in the inner loop of xdl_hash_record Redo the hashing loop in xdl_hash_record in a way that loads an entire 'long' at a time, using masking tricks to see when and where we found the terminating '\n'. I stole inspiration and code from the posts by Linus Torvalds around https://lkml.org/lkml/2012/3/2/452 https://lkml.org/lkml/2012/3/5/6 His method reads the buffers in sizeof(long) increments, and may thus overrun it by at most sizeof(long)-1 bytes before it sees the final newline (or hits the buffer length check). I considered padding out all buffers by a suitable amount to "catch" the overrun, but * this does not work for mmap()'d buffers: if you map 4096+8 bytes from a 4096 byte file, accessing the last 8 bytes results in a SIGBUS on my machine; and * it would also be extremely ugly because it intrudes deep into the unpacking machinery. So I adapted it to not read beyond the buffer at all. Instead, it reads the final partial word byte-by-byte and strings it together. Then it can use the same logic as before to finish the hashing. So far we enable this only on x86_64, where it provides nice speedup for diff-related work: Test origin/next tr/xdiff-fast-hash ----------------------------------------------------------------------------- 4000.1: log -3000 (baseline) 0.07(0.05+0.02) 0.08(0.06+0.02) +14.3% 4000.2: log --raw -3000 (tree-only) 0.37(0.33+0.04) 0.37(0.32+0.04) +0.0% 4000.3: log -p -3000 (Myers) 1.75(1.65+0.09) 1.60(1.49+0.10) -8.6% 4000.4: log -p -3000 --histogram 1.73(1.62+0.09) 1.58(1.49+0.08) -8.7% 4000.5: log -p -3000 --patience 2.11(2.00+0.10) 1.94(1.80+0.11) -8.1% Perhaps other platforms could also benefit. However it does NOT work on big-endian systems! [jc: minimum style and compilation fixes] Signed-off-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-04-06 23:01:23 +02:00
#include <limits.h>
#include <assert.h>
Use a *real* built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 05:13:22 +01:00
#include "xinclude.h"
long xdl_bogosqrt(long n) {
long i;
/*
* Classical integer square root approximation using shifts.
*/
for (i = 1; n > 0; n >>= 2)
i <<= 1;
return i;
}
Use a *real* built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 05:13:22 +01:00
int xdl_emit_diffrec(char const *rec, long size, char const *pre, long psize,
xdemitcb_t *ecb) {
int i = 2;
mmbuffer_t mb[3];
Use a *real* built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 05:13:22 +01:00
mb[0].ptr = (char *) pre;
mb[0].size = psize;
mb[1].ptr = (char *) rec;
mb[1].size = size;
if (size > 0 && rec[size - 1] != '\n') {
mb[2].ptr = (char *) "\n\\ No newline at end of file\n";
mb[2].size = strlen(mb[2].ptr);
i++;
}
if (ecb->outf(ecb->priv, mb, i) < 0) {
Use a *real* built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 05:13:22 +01:00
return -1;
}
return 0;
}
void *xdl_mmfile_first(mmfile_t *mmf, long *size)
{
*size = mmf->size;
return mmf->ptr;
}
long xdl_mmfile_size(mmfile_t *mmf)
{
return mmf->size;
}
int xdl_cha_init(chastore_t *cha, long isize, long icount) {
cha->head = cha->tail = NULL;
cha->isize = isize;
cha->nsize = icount * isize;
cha->ancur = cha->sncur = NULL;
cha->scurr = 0;
return 0;
}
void xdl_cha_free(chastore_t *cha) {
chanode_t *cur, *tmp;
for (cur = cha->head; (tmp = cur) != NULL;) {
cur = cur->next;
xdl_free(tmp);
}
}
void *xdl_cha_alloc(chastore_t *cha) {
chanode_t *ancur;
void *data;
if (!(ancur = cha->ancur) || ancur->icurr == cha->nsize) {
if (!(ancur = (chanode_t *) xdl_malloc(sizeof(chanode_t) + cha->nsize))) {
return NULL;
}
ancur->icurr = 0;
ancur->next = NULL;
if (cha->tail)
cha->tail->next = ancur;
if (!cha->head)
cha->head = ancur;
cha->tail = ancur;
cha->ancur = ancur;
}
data = (char *) ancur + sizeof(chanode_t) + ancur->icurr;
ancur->icurr += cha->isize;
return data;
}
long xdl_guess_lines(mmfile_t *mf, long sample) {
Use a *real* built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 05:13:22 +01:00
long nl = 0, size, tsize = 0;
char const *data, *cur, *top;
if ((cur = data = xdl_mmfile_first(mf, &size)) != NULL) {
for (top = data + size; nl < sample && cur < top; ) {
Use a *real* built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 05:13:22 +01:00
nl++;
if (!(cur = memchr(cur, '\n', top - cur)))
cur = top;
else
cur++;
}
tsize += (long) (cur - data);
}
if (nl && tsize)
nl = xdl_mmfile_size(mf) / (tsize / nl);
return nl + 1;
}
int xdl_recmatch(const char *l1, long s1, const char *l2, long s2, long flags)
{
int i1, i2;
if (s1 == s2 && !memcmp(l1, l2, s1))
return 1;
if (!(flags & XDF_WHITESPACE_FLAGS))
return 0;
i1 = 0;
i2 = 0;
/*
* -w matches everything that matches with -b, and -b in turn
* matches everything that matches with --ignore-space-at-eol.
*
* Each flavor of ignoring needs different logic to skip whitespaces
* while we have both sides to compare.
*/
if (flags & XDF_IGNORE_WHITESPACE) {
goto skip_ws;
while (i1 < s1 && i2 < s2) {
if (l1[i1++] != l2[i2++])
return 0;
skip_ws:
while (i1 < s1 && XDL_ISSPACE(l1[i1]))
i1++;
while (i2 < s2 && XDL_ISSPACE(l2[i2]))
i2++;
}
} else if (flags & XDF_IGNORE_WHITESPACE_CHANGE) {
while (i1 < s1 && i2 < s2) {
if (XDL_ISSPACE(l1[i1]) && XDL_ISSPACE(l2[i2])) {
/* Skip matching spaces and try again */
while (i1 < s1 && XDL_ISSPACE(l1[i1]))
i1++;
while (i2 < s2 && XDL_ISSPACE(l2[i2]))
i2++;
continue;
}
if (l1[i1++] != l2[i2++])
return 0;
}
} else if (flags & XDF_IGNORE_WHITESPACE_AT_EOL) {
while (i1 < s1 && i2 < s2 && l1[i1++] == l2[i2++])
; /* keep going */
}
/*
* After running out of one side, the remaining side must have
* nothing but whitespace for the lines to match. Note that
* ignore-whitespace-at-eol case may break out of the loop
* while there still are characters remaining on both lines.
*/
if (i1 < s1) {
while (i1 < s1 && XDL_ISSPACE(l1[i1]))
i1++;
if (s1 != i1)
return 0;
}
if (i2 < s2) {
while (i2 < s2 && XDL_ISSPACE(l2[i2]))
i2++;
return (s2 == i2);
}
return 1;
}
Use a *real* built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 05:13:22 +01:00
static unsigned long xdl_hash_record_with_whitespace(char const **data,
char const *top, long flags) {
Use a *real* built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 05:13:22 +01:00
unsigned long ha = 5381;
char const *ptr = *data;
for (; ptr < top && *ptr != '\n'; ptr++) {
if (XDL_ISSPACE(*ptr)) {
const char *ptr2 = ptr;
int at_eol;
while (ptr + 1 < top && XDL_ISSPACE(ptr[1])
&& ptr[1] != '\n')
ptr++;
at_eol = (top <= ptr + 1 || ptr[1] == '\n');
if (flags & XDF_IGNORE_WHITESPACE)
; /* already handled */
else if (flags & XDF_IGNORE_WHITESPACE_CHANGE
&& !at_eol) {
ha += (ha << 5);
ha ^= (unsigned long) ' ';
}
else if (flags & XDF_IGNORE_WHITESPACE_AT_EOL
&& !at_eol) {
while (ptr2 != ptr + 1) {
ha += (ha << 5);
ha ^= (unsigned long) *ptr2;
ptr2++;
}
}
continue;
}
Use a *real* built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 05:13:22 +01:00
ha += (ha << 5);
ha ^= (unsigned long) *ptr;
}
*data = ptr < top ? ptr + 1: ptr;
return ha;
}
xdiff: load full words in the inner loop of xdl_hash_record Redo the hashing loop in xdl_hash_record in a way that loads an entire 'long' at a time, using masking tricks to see when and where we found the terminating '\n'. I stole inspiration and code from the posts by Linus Torvalds around https://lkml.org/lkml/2012/3/2/452 https://lkml.org/lkml/2012/3/5/6 His method reads the buffers in sizeof(long) increments, and may thus overrun it by at most sizeof(long)-1 bytes before it sees the final newline (or hits the buffer length check). I considered padding out all buffers by a suitable amount to "catch" the overrun, but * this does not work for mmap()'d buffers: if you map 4096+8 bytes from a 4096 byte file, accessing the last 8 bytes results in a SIGBUS on my machine; and * it would also be extremely ugly because it intrudes deep into the unpacking machinery. So I adapted it to not read beyond the buffer at all. Instead, it reads the final partial word byte-by-byte and strings it together. Then it can use the same logic as before to finish the hashing. So far we enable this only on x86_64, where it provides nice speedup for diff-related work: Test origin/next tr/xdiff-fast-hash ----------------------------------------------------------------------------- 4000.1: log -3000 (baseline) 0.07(0.05+0.02) 0.08(0.06+0.02) +14.3% 4000.2: log --raw -3000 (tree-only) 0.37(0.33+0.04) 0.37(0.32+0.04) +0.0% 4000.3: log -p -3000 (Myers) 1.75(1.65+0.09) 1.60(1.49+0.10) -8.6% 4000.4: log -p -3000 --histogram 1.73(1.62+0.09) 1.58(1.49+0.08) -8.7% 4000.5: log -p -3000 --patience 2.11(2.00+0.10) 1.94(1.80+0.11) -8.1% Perhaps other platforms could also benefit. However it does NOT work on big-endian systems! [jc: minimum style and compilation fixes] Signed-off-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-04-06 23:01:23 +02:00
#ifdef XDL_FAST_HASH
#define REPEAT_BYTE(x) ((~0ul / 0xff) * (x))
#define ONEBYTES REPEAT_BYTE(0x01)
#define NEWLINEBYTES REPEAT_BYTE(0x0a)
#define HIGHBITS REPEAT_BYTE(0x80)
xdiff: load full words in the inner loop of xdl_hash_record Redo the hashing loop in xdl_hash_record in a way that loads an entire 'long' at a time, using masking tricks to see when and where we found the terminating '\n'. I stole inspiration and code from the posts by Linus Torvalds around https://lkml.org/lkml/2012/3/2/452 https://lkml.org/lkml/2012/3/5/6 His method reads the buffers in sizeof(long) increments, and may thus overrun it by at most sizeof(long)-1 bytes before it sees the final newline (or hits the buffer length check). I considered padding out all buffers by a suitable amount to "catch" the overrun, but * this does not work for mmap()'d buffers: if you map 4096+8 bytes from a 4096 byte file, accessing the last 8 bytes results in a SIGBUS on my machine; and * it would also be extremely ugly because it intrudes deep into the unpacking machinery. So I adapted it to not read beyond the buffer at all. Instead, it reads the final partial word byte-by-byte and strings it together. Then it can use the same logic as before to finish the hashing. So far we enable this only on x86_64, where it provides nice speedup for diff-related work: Test origin/next tr/xdiff-fast-hash ----------------------------------------------------------------------------- 4000.1: log -3000 (baseline) 0.07(0.05+0.02) 0.08(0.06+0.02) +14.3% 4000.2: log --raw -3000 (tree-only) 0.37(0.33+0.04) 0.37(0.32+0.04) +0.0% 4000.3: log -p -3000 (Myers) 1.75(1.65+0.09) 1.60(1.49+0.10) -8.6% 4000.4: log -p -3000 --histogram 1.73(1.62+0.09) 1.58(1.49+0.08) -8.7% 4000.5: log -p -3000 --patience 2.11(2.00+0.10) 1.94(1.80+0.11) -8.1% Perhaps other platforms could also benefit. However it does NOT work on big-endian systems! [jc: minimum style and compilation fixes] Signed-off-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-04-06 23:01:23 +02:00
/* Return the high bit set in the first byte that is a zero */
static inline unsigned long has_zero(unsigned long a)
{
return ((a - ONEBYTES) & ~a) & HIGHBITS;
}
static inline long count_masked_bytes(unsigned long mask)
{
xdiff: choose XDL_FAST_HASH code on sizeof(long) instead of __WORDSIZE Darwin does not define __WORDSIZE, and compiles the 32-bit code path on 64-bit systems, resulting in a totally broken git. I could not find an alternative -- other than the platform symbols (__x86_64__ etc.) -- that does the test in the preprocessor. However, we can also just test for the size of a 'long', which is what really matters here. Any compiler worth its salt will leave only the branch relevant for its platform, and indeed on Linux/GCC the numbers don't change: Test tr/darwin-xdl-fast-hash origin/next origin/master ------------------------------------------------------------------------------------------------------------------ 4000.1: log -3000 (baseline) 0.09(0.07+0.01) 0.09(0.07+0.01) -5.5%* 0.09(0.07+0.01) -4.1% 4000.2: log --raw -3000 (tree-only) 0.47(0.41+0.05) 0.47(0.40+0.05) -0.5% 0.45(0.38+0.06) -3.5%. 4000.3: log -p -3000 (Myers) 1.81(1.67+0.12) 1.81(1.67+0.13) +0.3% 1.99(1.84+0.12) +10.2%*** 4000.4: log -p -3000 --histogram 1.79(1.66+0.11) 1.80(1.67+0.11) +0.4% 1.96(1.82+0.10) +9.2%*** 4000.5: log -p -3000 --patience 2.17(2.02+0.13) 2.20(2.04+0.13) +1.3%. 2.33(2.18+0.13) +7.4%*** ------------------------------------------------------------------------------------------------------------------ Significance hints: '.' 0.1 '*' 0.05 '**' 0.01 '***' 0.001 Noticed-by: Brian Gernhardt <brian@gernhardtsoftware.com> Signed-off-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-05-01 12:23:20 +02:00
if (sizeof(long) == 8) {
/*
* Jan Achrenius on G+: microoptimized version of
* the simpler "(mask & ONEBYTES) * ONEBYTES >> 56"
* that works for the bytemasks without having to
* mask them first.
*/
/*
* return mask * 0x0001020304050608 >> 56;
xdiff: choose XDL_FAST_HASH code on sizeof(long) instead of __WORDSIZE Darwin does not define __WORDSIZE, and compiles the 32-bit code path on 64-bit systems, resulting in a totally broken git. I could not find an alternative -- other than the platform symbols (__x86_64__ etc.) -- that does the test in the preprocessor. However, we can also just test for the size of a 'long', which is what really matters here. Any compiler worth its salt will leave only the branch relevant for its platform, and indeed on Linux/GCC the numbers don't change: Test tr/darwin-xdl-fast-hash origin/next origin/master ------------------------------------------------------------------------------------------------------------------ 4000.1: log -3000 (baseline) 0.09(0.07+0.01) 0.09(0.07+0.01) -5.5%* 0.09(0.07+0.01) -4.1% 4000.2: log --raw -3000 (tree-only) 0.47(0.41+0.05) 0.47(0.40+0.05) -0.5% 0.45(0.38+0.06) -3.5%. 4000.3: log -p -3000 (Myers) 1.81(1.67+0.12) 1.81(1.67+0.13) +0.3% 1.99(1.84+0.12) +10.2%*** 4000.4: log -p -3000 --histogram 1.79(1.66+0.11) 1.80(1.67+0.11) +0.4% 1.96(1.82+0.10) +9.2%*** 4000.5: log -p -3000 --patience 2.17(2.02+0.13) 2.20(2.04+0.13) +1.3%. 2.33(2.18+0.13) +7.4%*** ------------------------------------------------------------------------------------------------------------------ Significance hints: '.' 0.1 '*' 0.05 '**' 0.01 '***' 0.001 Noticed-by: Brian Gernhardt <brian@gernhardtsoftware.com> Signed-off-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-05-01 12:23:20 +02:00
*
* Doing it like this avoids warnings on 32-bit machines.
xdiff: choose XDL_FAST_HASH code on sizeof(long) instead of __WORDSIZE Darwin does not define __WORDSIZE, and compiles the 32-bit code path on 64-bit systems, resulting in a totally broken git. I could not find an alternative -- other than the platform symbols (__x86_64__ etc.) -- that does the test in the preprocessor. However, we can also just test for the size of a 'long', which is what really matters here. Any compiler worth its salt will leave only the branch relevant for its platform, and indeed on Linux/GCC the numbers don't change: Test tr/darwin-xdl-fast-hash origin/next origin/master ------------------------------------------------------------------------------------------------------------------ 4000.1: log -3000 (baseline) 0.09(0.07+0.01) 0.09(0.07+0.01) -5.5%* 0.09(0.07+0.01) -4.1% 4000.2: log --raw -3000 (tree-only) 0.47(0.41+0.05) 0.47(0.40+0.05) -0.5% 0.45(0.38+0.06) -3.5%. 4000.3: log -p -3000 (Myers) 1.81(1.67+0.12) 1.81(1.67+0.13) +0.3% 1.99(1.84+0.12) +10.2%*** 4000.4: log -p -3000 --histogram 1.79(1.66+0.11) 1.80(1.67+0.11) +0.4% 1.96(1.82+0.10) +9.2%*** 4000.5: log -p -3000 --patience 2.17(2.02+0.13) 2.20(2.04+0.13) +1.3%. 2.33(2.18+0.13) +7.4%*** ------------------------------------------------------------------------------------------------------------------ Significance hints: '.' 0.1 '*' 0.05 '**' 0.01 '***' 0.001 Noticed-by: Brian Gernhardt <brian@gernhardtsoftware.com> Signed-off-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-05-01 12:23:20 +02:00
*/
long a = (REPEAT_BYTE(0x01) / 0xff + 1);
return mask * a >> (sizeof(long) * 7);
xdiff: choose XDL_FAST_HASH code on sizeof(long) instead of __WORDSIZE Darwin does not define __WORDSIZE, and compiles the 32-bit code path on 64-bit systems, resulting in a totally broken git. I could not find an alternative -- other than the platform symbols (__x86_64__ etc.) -- that does the test in the preprocessor. However, we can also just test for the size of a 'long', which is what really matters here. Any compiler worth its salt will leave only the branch relevant for its platform, and indeed on Linux/GCC the numbers don't change: Test tr/darwin-xdl-fast-hash origin/next origin/master ------------------------------------------------------------------------------------------------------------------ 4000.1: log -3000 (baseline) 0.09(0.07+0.01) 0.09(0.07+0.01) -5.5%* 0.09(0.07+0.01) -4.1% 4000.2: log --raw -3000 (tree-only) 0.47(0.41+0.05) 0.47(0.40+0.05) -0.5% 0.45(0.38+0.06) -3.5%. 4000.3: log -p -3000 (Myers) 1.81(1.67+0.12) 1.81(1.67+0.13) +0.3% 1.99(1.84+0.12) +10.2%*** 4000.4: log -p -3000 --histogram 1.79(1.66+0.11) 1.80(1.67+0.11) +0.4% 1.96(1.82+0.10) +9.2%*** 4000.5: log -p -3000 --patience 2.17(2.02+0.13) 2.20(2.04+0.13) +1.3%. 2.33(2.18+0.13) +7.4%*** ------------------------------------------------------------------------------------------------------------------ Significance hints: '.' 0.1 '*' 0.05 '**' 0.01 '***' 0.001 Noticed-by: Brian Gernhardt <brian@gernhardtsoftware.com> Signed-off-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-05-01 12:23:20 +02:00
} else {
/* Carl Chatfield / Jan Achrenius G+ version for 32-bit */
/* (000000 0000ff 00ffff ffffff) -> ( 1 1 2 3 ) */
long a = (0x0ff0001 + mask) >> 23;
/* Fix the 1 for 00 case */
return a & mask;
xdiff: choose XDL_FAST_HASH code on sizeof(long) instead of __WORDSIZE Darwin does not define __WORDSIZE, and compiles the 32-bit code path on 64-bit systems, resulting in a totally broken git. I could not find an alternative -- other than the platform symbols (__x86_64__ etc.) -- that does the test in the preprocessor. However, we can also just test for the size of a 'long', which is what really matters here. Any compiler worth its salt will leave only the branch relevant for its platform, and indeed on Linux/GCC the numbers don't change: Test tr/darwin-xdl-fast-hash origin/next origin/master ------------------------------------------------------------------------------------------------------------------ 4000.1: log -3000 (baseline) 0.09(0.07+0.01) 0.09(0.07+0.01) -5.5%* 0.09(0.07+0.01) -4.1% 4000.2: log --raw -3000 (tree-only) 0.47(0.41+0.05) 0.47(0.40+0.05) -0.5% 0.45(0.38+0.06) -3.5%. 4000.3: log -p -3000 (Myers) 1.81(1.67+0.12) 1.81(1.67+0.13) +0.3% 1.99(1.84+0.12) +10.2%*** 4000.4: log -p -3000 --histogram 1.79(1.66+0.11) 1.80(1.67+0.11) +0.4% 1.96(1.82+0.10) +9.2%*** 4000.5: log -p -3000 --patience 2.17(2.02+0.13) 2.20(2.04+0.13) +1.3%. 2.33(2.18+0.13) +7.4%*** ------------------------------------------------------------------------------------------------------------------ Significance hints: '.' 0.1 '*' 0.05 '**' 0.01 '***' 0.001 Noticed-by: Brian Gernhardt <brian@gernhardtsoftware.com> Signed-off-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-05-01 12:23:20 +02:00
}
xdiff: load full words in the inner loop of xdl_hash_record Redo the hashing loop in xdl_hash_record in a way that loads an entire 'long' at a time, using masking tricks to see when and where we found the terminating '\n'. I stole inspiration and code from the posts by Linus Torvalds around https://lkml.org/lkml/2012/3/2/452 https://lkml.org/lkml/2012/3/5/6 His method reads the buffers in sizeof(long) increments, and may thus overrun it by at most sizeof(long)-1 bytes before it sees the final newline (or hits the buffer length check). I considered padding out all buffers by a suitable amount to "catch" the overrun, but * this does not work for mmap()'d buffers: if you map 4096+8 bytes from a 4096 byte file, accessing the last 8 bytes results in a SIGBUS on my machine; and * it would also be extremely ugly because it intrudes deep into the unpacking machinery. So I adapted it to not read beyond the buffer at all. Instead, it reads the final partial word byte-by-byte and strings it together. Then it can use the same logic as before to finish the hashing. So far we enable this only on x86_64, where it provides nice speedup for diff-related work: Test origin/next tr/xdiff-fast-hash ----------------------------------------------------------------------------- 4000.1: log -3000 (baseline) 0.07(0.05+0.02) 0.08(0.06+0.02) +14.3% 4000.2: log --raw -3000 (tree-only) 0.37(0.33+0.04) 0.37(0.32+0.04) +0.0% 4000.3: log -p -3000 (Myers) 1.75(1.65+0.09) 1.60(1.49+0.10) -8.6% 4000.4: log -p -3000 --histogram 1.73(1.62+0.09) 1.58(1.49+0.08) -8.7% 4000.5: log -p -3000 --patience 2.11(2.00+0.10) 1.94(1.80+0.11) -8.1% Perhaps other platforms could also benefit. However it does NOT work on big-endian systems! [jc: minimum style and compilation fixes] Signed-off-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-04-06 23:01:23 +02:00
}
unsigned long xdl_hash_record(char const **data, char const *top, long flags)
{
unsigned long hash = 5381;
unsigned long a = 0, mask = 0;
char const *ptr = *data;
char const *end = top - sizeof(unsigned long) + 1;
if (flags & XDF_WHITESPACE_FLAGS)
return xdl_hash_record_with_whitespace(data, top, flags);
ptr -= sizeof(unsigned long);
do {
hash += hash << 5;
hash ^= a;
ptr += sizeof(unsigned long);
if (ptr >= end)
break;
a = *(unsigned long *)ptr;
/* Do we have any '\n' bytes in this word? */
mask = has_zero(a ^ NEWLINEBYTES);
} while (!mask);
if (ptr >= end) {
/*
* There is only a partial word left at the end of the
* buffer. Because we may work with a memory mapping,
* we have to grab the rest byte by byte instead of
* blindly reading it.
*
* To avoid problems with masking in a signed value,
* we use an unsigned char here.
*/
const char *p;
for (p = top - 1; p >= ptr; p--)
a = (a << 8) + *((const unsigned char *)p);
mask = has_zero(a ^ NEWLINEBYTES);
if (!mask)
/*
* No '\n' found in the partial word. Make a
* mask that matches what we read.
*/
mask = 1UL << (8 * (top - ptr) + 7);
}
/* The mask *below* the first high bit set */
mask = (mask - 1) & ~mask;
mask >>= 7;
hash += hash << 5;
hash ^= a & mask;
/* Advance past the last (possibly partial) word */
ptr += count_masked_bytes(mask);
if (ptr < top) {
assert(*ptr == '\n');
ptr++;
}
*data = ptr;
return hash;
}
#else /* XDL_FAST_HASH */
Use a *real* built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 05:13:22 +01:00
unsigned long xdl_hash_record(char const **data, char const *top, long flags) {
unsigned long ha = 5381;
char const *ptr = *data;
if (flags & XDF_WHITESPACE_FLAGS)
return xdl_hash_record_with_whitespace(data, top, flags);
for (; ptr < top && *ptr != '\n'; ptr++) {
ha += (ha << 5);
ha ^= (unsigned long) *ptr;
}
*data = ptr < top ? ptr + 1: ptr;
return ha;
}
xdiff: load full words in the inner loop of xdl_hash_record Redo the hashing loop in xdl_hash_record in a way that loads an entire 'long' at a time, using masking tricks to see when and where we found the terminating '\n'. I stole inspiration and code from the posts by Linus Torvalds around https://lkml.org/lkml/2012/3/2/452 https://lkml.org/lkml/2012/3/5/6 His method reads the buffers in sizeof(long) increments, and may thus overrun it by at most sizeof(long)-1 bytes before it sees the final newline (or hits the buffer length check). I considered padding out all buffers by a suitable amount to "catch" the overrun, but * this does not work for mmap()'d buffers: if you map 4096+8 bytes from a 4096 byte file, accessing the last 8 bytes results in a SIGBUS on my machine; and * it would also be extremely ugly because it intrudes deep into the unpacking machinery. So I adapted it to not read beyond the buffer at all. Instead, it reads the final partial word byte-by-byte and strings it together. Then it can use the same logic as before to finish the hashing. So far we enable this only on x86_64, where it provides nice speedup for diff-related work: Test origin/next tr/xdiff-fast-hash ----------------------------------------------------------------------------- 4000.1: log -3000 (baseline) 0.07(0.05+0.02) 0.08(0.06+0.02) +14.3% 4000.2: log --raw -3000 (tree-only) 0.37(0.33+0.04) 0.37(0.32+0.04) +0.0% 4000.3: log -p -3000 (Myers) 1.75(1.65+0.09) 1.60(1.49+0.10) -8.6% 4000.4: log -p -3000 --histogram 1.73(1.62+0.09) 1.58(1.49+0.08) -8.7% 4000.5: log -p -3000 --patience 2.11(2.00+0.10) 1.94(1.80+0.11) -8.1% Perhaps other platforms could also benefit. However it does NOT work on big-endian systems! [jc: minimum style and compilation fixes] Signed-off-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-04-06 23:01:23 +02:00
#endif /* XDL_FAST_HASH */
Use a *real* built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 05:13:22 +01:00
unsigned int xdl_hashbits(unsigned int size) {
unsigned int val = 1, bits = 0;
for (; val < size && bits < CHAR_BIT * sizeof(unsigned int); val <<= 1, bits++);
return bits ? bits: 1;
}
int xdl_num_out(char *out, long val) {
char *ptr, *str = out;
char buf[32];
ptr = buf + sizeof(buf) - 1;
*ptr = '\0';
if (val < 0) {
*--ptr = '-';
val = -val;
}
for (; val && ptr > buf; val /= 10)
*--ptr = "0123456789"[val % 10];
if (*ptr)
for (; *ptr; ptr++, str++)
*str = *ptr;
else
*str++ = '0';
*str = '\0';
return str - out;
}
int xdl_emit_hunk_hdr(long s1, long c1, long s2, long c2,
const char *func, long funclen, xdemitcb_t *ecb) {
Use a *real* built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 05:13:22 +01:00
int nb = 0;
mmbuffer_t mb;
char buf[128];
memcpy(buf, "@@ -", 4);
nb += 4;
nb += xdl_num_out(buf + nb, c1 ? s1: s1 - 1);
Use a *real* built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 05:13:22 +01:00
if (c1 != 1) {
memcpy(buf + nb, ",", 1);
nb += 1;
Use a *real* built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 05:13:22 +01:00
nb += xdl_num_out(buf + nb, c1);
}
Use a *real* built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 05:13:22 +01:00
memcpy(buf + nb, " +", 2);
nb += 2;
nb += xdl_num_out(buf + nb, c2 ? s2: s2 - 1);
Use a *real* built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 05:13:22 +01:00
if (c2 != 1) {
memcpy(buf + nb, ",", 1);
nb += 1;
Use a *real* built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 05:13:22 +01:00
nb += xdl_num_out(buf + nb, c2);
}
Use a *real* built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 05:13:22 +01:00
memcpy(buf + nb, " @@", 3);
nb += 3;
if (func && funclen) {
buf[nb++] = ' ';
if (funclen > sizeof(buf) - nb - 1)
funclen = sizeof(buf) - nb - 1;
memcpy(buf + nb, func, funclen);
nb += funclen;
}
buf[nb++] = '\n';
Use a *real* built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 05:13:22 +01:00
mb.ptr = buf;
mb.size = nb;
if (ecb->outf(ecb->priv, &mb, 1) < 0)
return -1;
return 0;
}
int xdl_fall_back_diff(xdfenv_t *diff_env, xpparam_t const *xpp,
int line1, int count1, int line2, int count2)
{
/*
* This probably does not work outside Git, since
* we have a very simple mmfile structure.
*
* Note: ideally, we would reuse the prepared environment, but
* the libxdiff interface does not (yet) allow for diffing only
* ranges of lines instead of the whole files.
*/
mmfile_t subfile1, subfile2;
xdfenv_t env;
subfile1.ptr = (char *)diff_env->xdf1.recs[line1 - 1]->ptr;
subfile1.size = diff_env->xdf1.recs[line1 + count1 - 2]->ptr +
diff_env->xdf1.recs[line1 + count1 - 2]->size - subfile1.ptr;
subfile2.ptr = (char *)diff_env->xdf2.recs[line2 - 1]->ptr;
subfile2.size = diff_env->xdf2.recs[line2 + count2 - 2]->ptr +
diff_env->xdf2.recs[line2 + count2 - 2]->size - subfile2.ptr;
if (xdl_do_diff(&subfile1, &subfile2, xpp, &env) < 0)
return -1;
memcpy(diff_env->xdf1.rchg + line1 - 1, env.xdf1.rchg, count1);
memcpy(diff_env->xdf2.rchg + line2 - 1, env.xdf2.rchg, count2);
xdl_free_env(&env);
return 0;
}