mirrors/git - Incest Forge: Beyond sex. We incest.

mirrors/git

mirror of https://github.com/git/git.git synced 2024-11-18 06:54:55 +01:00

483 lines

11 KiB

C

Raw Normal View History

Use a real built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 05:13:22 +01:00			`/*`
			`* LibXDiff by Davide Libenzi ( File Differential Library )`
			`* Copyright (C) 2003 Davide Libenzi`
			`*`
			`* This library is free software; you can redistribute it and/or`
			`* modify it under the terms of the GNU Lesser General Public`
			`* License as published by the Free Software Foundation; either`
			`* version 2.1 of the License, or (at your option) any later version.`
			`*`
			`* This library is distributed in the hope that it will be useful,`
			`* but WITHOUT ANY WARRANTY; without even the implied warranty of`
			`* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU`
			`* Lesser General Public License for more details.`
			`*`
			`* You should have received a copy of the GNU Lesser General Public`
			`* License along with this library; if not, write to the Free Software`
			`* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA`
			`*`
			`* Davide Libenzi <davidel@xmailserver.org>`
			`*`
			`*/`

xdiff: load full words in the inner loop of xdl_hash_record Redo the hashing loop in xdl_hash_record in a way that loads an entire 'long' at a time, using masking tricks to see when and where we found the terminating '\n'. I stole inspiration and code from the posts by Linus Torvalds around https://lkml.org/lkml/2012/3/2/452 https://lkml.org/lkml/2012/3/5/6 His method reads the buffers in sizeof(long) increments, and may thus overrun it by at most sizeof(long)-1 bytes before it sees the final newline (or hits the buffer length check). I considered padding out all buffers by a suitable amount to "catch" the overrun, but * this does not work for mmap()'d buffers: if you map 4096+8 bytes from a 4096 byte file, accessing the last 8 bytes results in a SIGBUS on my machine; and * it would also be extremely ugly because it intrudes deep into the unpacking machinery. So I adapted it to not read beyond the buffer at all. Instead, it reads the final partial word byte-by-byte and strings it together. Then it can use the same logic as before to finish the hashing. So far we enable this only on x86_64, where it provides nice speedup for diff-related work: Test origin/next tr/xdiff-fast-hash ----------------------------------------------------------------------------- 4000.1: log -3000 (baseline) 0.07(0.05+0.02) 0.08(0.06+0.02) +14.3% 4000.2: log --raw -3000 (tree-only) 0.37(0.33+0.04) 0.37(0.32+0.04) +0.0% 4000.3: log -p -3000 (Myers) 1.75(1.65+0.09) 1.60(1.49+0.10) -8.6% 4000.4: log -p -3000 --histogram 1.73(1.62+0.09) 1.58(1.49+0.08) -8.7% 4000.5: log -p -3000 --patience 2.11(2.00+0.10) 1.94(1.80+0.11) -8.1% Perhaps other platforms could also benefit. However it does NOT work on big-endian systems! [jc: minimum style and compilation fixes] Signed-off-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2012-04-06 23:01:23 +02:00			`#include <limits.h>`
			`#include <assert.h>`
Use a real built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 05:13:22 +01:00			`#include "xinclude.h"`




Clean-up trivially redundant diff. Also corrects the line numbers in unified output when using zero lines context. 2006-04-04 03:47:55 +02:00			`long xdl_bogosqrt(long n) {`
			`long i;`

			`/*`
			`* Classical integer square root approximation using shifts.`
			`*/`
			`for (i = 1; n > 0; n >>= 2)`
			`i <<= 1;`

			`return i;`
			`}`


Use a real built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 05:13:22 +01:00			`int xdl_emit_diffrec(char const rec, long size, char const pre, long psize,`
			`xdemitcb_t *ecb) {`
xdiff: minor changes to match libxdiff-0.21 This reformats the change 621c53cc082299eaf69e9f2dc0274547c7d87fb0 introduced to match what upstream author implemented in libxdiff-0.21 without changing any logic (hopefully ;-). This is to help keep us in sync with the upstream. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-06-20 02:01:35 +02:00			`int i = 2;`
builtin-diff: \No newline at end of file. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 16:56:03 +01:00			`mmbuffer_t mb[3];`
Use a real built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 05:13:22 +01:00
			`mb[0].ptr = (char *) pre;`
			`mb[0].size = psize;`
			`mb[1].ptr = (char *) rec;`
			`mb[1].size = size;`
xdiff: minor changes to match libxdiff-0.21 This reformats the change 621c53cc082299eaf69e9f2dc0274547c7d87fb0 introduced to match what upstream author implemented in libxdiff-0.21 without changing any logic (hopefully ;-). This is to help keep us in sync with the upstream. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-06-20 02:01:35 +02:00			`if (size > 0 && rec[size - 1] != '\n') {`
			`mb[2].ptr = (char *) "\n\\ No newline at end of file\n";`
builtin-diff: \No newline at end of file. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 16:56:03 +01:00			`mb[2].size = strlen(mb[2].ptr);`
xdiff: minor changes to match libxdiff-0.21 This reformats the change 621c53cc082299eaf69e9f2dc0274547c7d87fb0 introduced to match what upstream author implemented in libxdiff-0.21 without changing any logic (hopefully ;-). This is to help keep us in sync with the upstream. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-06-20 02:01:35 +02:00			`i++;`
builtin-diff: \No newline at end of file. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 16:56:03 +01:00			`}`
			`if (ecb->outf(ecb->priv, mb, i) < 0) {`
Use a real built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 05:13:22 +01:00
			`return -1;`
			`}`

			`return 0;`
			`}`

			`void xdl_mmfile_first(mmfile_t mmf, long *size)`
			`{`
			`*size = mmf->size;`
			`return mmf->ptr;`
			`}`


			`long xdl_mmfile_size(mmfile_t *mmf)`
			`{`
			`return mmf->size;`
			`}`


			`int xdl_cha_init(chastore_t *cha, long isize, long icount) {`

			`cha->head = cha->tail = NULL;`
			`cha->isize = isize;`
			`cha->nsize = icount * isize;`
			`cha->ancur = cha->sncur = NULL;`
			`cha->scurr = 0;`

			`return 0;`
			`}`


			`void xdl_cha_free(chastore_t *cha) {`
			`chanode_t cur, tmp;`

			`for (cur = cha->head; (tmp = cur) != NULL;) {`
			`cur = cur->next;`
			`xdl_free(tmp);`
			`}`
			`}`


			`void xdl_cha_alloc(chastore_t cha) {`
			`chanode_t *ancur;`
			`void *data;`

			`if (!(ancur = cha->ancur) \|\| ancur->icurr == cha->nsize) {`
			`if (!(ancur = (chanode_t *) xdl_malloc(sizeof(chanode_t) + cha->nsize))) {`

			`return NULL;`
			`}`
			`ancur->icurr = 0;`
			`ancur->next = NULL;`
			`if (cha->tail)`
			`cha->tail->next = ancur;`
			`if (!cha->head)`
			`cha->head = ancur;`
			`cha->tail = ancur;`
			`cha->ancur = ancur;`
			`}`

			`data = (char *) ancur + sizeof(chanode_t) + ancur->icurr;`
			`ancur->icurr += cha->isize;`

			`return data;`
			`}`

xdiff/xprepare: use a smaller sample size for histogram diff For histogram diff, we can afford a smaller sample size and thus a poorer estimate of the number of lines, as the hash table (rhash) won't be filled up/grown. This is safe as the final count of lines (xdf.nrecs) will be updated correctly anyway by xdl_prepare_ctx(). This gives us a small boost in performance. Signed-off-by: Tay Ray Chuan <rctay89@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-07-12 08:10:27 +02:00			`long xdl_guess_lines(mmfile_t *mf, long sample) {`
Use a real built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 05:13:22 +01:00			`long nl = 0, size, tsize = 0;`
			`char const data, cur, *top;`

			`if ((cur = data = xdl_mmfile_first(mf, &size)) != NULL) {`
xdiff: do away with xdl_mmfile_next() Given our simple mmfile structure, xdl_mmfile_next() calls are redundant. Do away with calls to them. Signed-off-by: Tay Ray Chuan <rctay89@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-08-01 06:20:07 +02:00			`for (top = data + size; nl < sample && cur < top; ) {`
Use a real built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 05:13:22 +01:00			`nl++;`
			`if (!(cur = memchr(cur, '\n', top - cur)))`
			`cur = top;`
			`else`
			`cur++;`
			`}`
			`tsize += (long) (cur - data);`
			`}`

			`if (nl && tsize)`
			`nl = xdl_mmfile_size(mf) / (tsize / nl);`

			`return nl + 1;`
			`}`

Teach diff about -b and -w flags This adds -b (--ignore-space-change) and -w (--ignore-all-space) flags to diff. The main part of the patch is teaching libxdiff about it. [jc: renamed xdl_line_match() to xdl_recmatch() since the former is used for different purposes in xpatchi.c which is in the parts of the upstream source we do not use.] Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-06-14 17:40:23 +02:00			`int xdl_recmatch(const char l1, long s1, const char l2, long s2, long flags)`
			`{`
			`int i1, i2;`

xdiff: optimise for no whitespace difference when ignoring whitespace. In xdl_recmatch, do the memcmp to check if the two lines are equal before checking if whitespace flags are set. If the lines are identical, then there is no need to check if they differ only in whitespace. This makes the common case (there is no whitespace difference) faster. It costs the case where lines are the same length and contain whitespace differences, but the common case is more than 20% faster. Signed-off-by: Dylan Reid <dgreid@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2010-07-06 05:11:17 +02:00			`if (s1 == s2 && !memcmp(l1, l2, s1))`
			`return 1;`
xutils: Fix xdl_recmatch() on incomplete lines Thell Fowler noticed that various "ignore whitespace" options to git diff do not work well on an incomplete line. The loop control of the function responsible for these bugs was extremely difficult to follow. This patch restructures the loops for three variants of "ignore whitespace" logic. The basic idea of the re-written logic is: - A loop runs while the characters from both strings we are looking at match. We declare unmatch immediately when we find something that does not match and return false from the function. We break out of the loop if we ran out of either side of the string. The way we skip spaces inside this loop varies depending on the style of ignoring whitespaces. - After the above loop breaks, we know that the parts of the strings we inspected so far match, ignoring the whitespaces. The lines can match only if the remainder consists of nothing but whitespaces. This part of the logic is shared across all three styles. The new code is more obvious and should be much easier to follow. Tested-by: Thell Fowler <git@tbfowler.name> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-23 09:57:18 +02:00			`if (!(flags & XDF_WHITESPACE_FLAGS))`
xdiff: optimise for no whitespace difference when ignoring whitespace. In xdl_recmatch, do the memcmp to check if the two lines are equal before checking if whitespace flags are set. If the lines are identical, then there is no need to check if they differ only in whitespace. This makes the common case (there is no whitespace difference) faster. It costs the case where lines are the same length and contain whitespace differences, but the common case is more than 20% faster. Signed-off-by: Dylan Reid <dgreid@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2010-07-06 05:11:17 +02:00			`return 0;`
xutils: Fix xdl_recmatch() on incomplete lines Thell Fowler noticed that various "ignore whitespace" options to git diff do not work well on an incomplete line. The loop control of the function responsible for these bugs was extremely difficult to follow. This patch restructures the loops for three variants of "ignore whitespace" logic. The basic idea of the re-written logic is: - A loop runs while the characters from both strings we are looking at match. We declare unmatch immediately when we find something that does not match and return false from the function. We break out of the loop if we ran out of either side of the string. The way we skip spaces inside this loop varies depending on the style of ignoring whitespaces. - After the above loop breaks, we know that the parts of the strings we inspected so far match, ignoring the whitespaces. The lines can match only if the remainder consists of nothing but whitespaces. This part of the logic is shared across all three styles. The new code is more obvious and should be much easier to follow. Tested-by: Thell Fowler <git@tbfowler.name> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-23 09:57:18 +02:00
			`i1 = 0;`
			`i2 = 0;`

			`/*`
			`* -w matches everything that matches with -b, and -b in turn`
			`* matches everything that matches with --ignore-space-at-eol.`
			`*`
			`* Each flavor of ignoring needs different logic to skip whitespaces`
			`* while we have both sides to compare.`
			`*/`
Teach diff about -b and -w flags This adds -b (--ignore-space-change) and -w (--ignore-all-space) flags to diff. The main part of the patch is teaching libxdiff about it. [jc: renamed xdl_line_match() to xdl_recmatch() since the former is used for different purposes in xpatchi.c which is in the parts of the upstream source we do not use.] Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-06-14 17:40:23 +02:00			`if (flags & XDF_IGNORE_WHITESPACE) {`
xutils: Fix xdl_recmatch() on incomplete lines Thell Fowler noticed that various "ignore whitespace" options to git diff do not work well on an incomplete line. The loop control of the function responsible for these bugs was extremely difficult to follow. This patch restructures the loops for three variants of "ignore whitespace" logic. The basic idea of the re-written logic is: - A loop runs while the characters from both strings we are looking at match. We declare unmatch immediately when we find something that does not match and return false from the function. We break out of the loop if we ran out of either side of the string. The way we skip spaces inside this loop varies depending on the style of ignoring whitespaces. - After the above loop breaks, we know that the parts of the strings we inspected so far match, ignoring the whitespaces. The lines can match only if the remainder consists of nothing but whitespaces. This part of the logic is shared across all three styles. The new code is more obvious and should be much easier to follow. Tested-by: Thell Fowler <git@tbfowler.name> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-23 09:57:18 +02:00			`goto skip_ws;`
			`while (i1 < s1 && i2 < s2) {`
			`if (l1[i1++] != l2[i2++])`
diff: fix 2 whitespace issues When whitespace or whitespace change was ignored, the function xdl_recmatch() returned memcmp() style differences, which is wrong, since it should return 0 on non-match. Also, there were three horrible off-by-one bugs, even leading to wrong hashes in the whitespace special handling. The issue was noticed by Ray Lehtiniemi. For good measure, this commit adds a test. Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-10-12 14:22:14 +02:00			`return 0;`
xutils: Fix xdl_recmatch() on incomplete lines Thell Fowler noticed that various "ignore whitespace" options to git diff do not work well on an incomplete line. The loop control of the function responsible for these bugs was extremely difficult to follow. This patch restructures the loops for three variants of "ignore whitespace" logic. The basic idea of the re-written logic is: - A loop runs while the characters from both strings we are looking at match. We declare unmatch immediately when we find something that does not match and return false from the function. We break out of the loop if we ran out of either side of the string. The way we skip spaces inside this loop varies depending on the style of ignoring whitespaces. - After the above loop breaks, we know that the parts of the strings we inspected so far match, ignoring the whitespaces. The lines can match only if the remainder consists of nothing but whitespaces. This part of the logic is shared across all three styles. The new code is more obvious and should be much easier to follow. Tested-by: Thell Fowler <git@tbfowler.name> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-23 09:57:18 +02:00			`skip_ws:`
xdiff: cast arguments for ctype functions to unsigned char The ctype functions isspace(), isalnum(), et al take an integer argument representing an unsigned character, or -1 for EOF. On platforms with a signed char, it is unsafe to pass a char to them without casting it to unsigned char first. Most of git is already shielded against this by the ctype implementation in git-compat-util.h, but xdiff, which uses libc ctype.h, ought to be fixed. Noticed-by: der Mouse <mouse@Rodents-Montreal.ORG> Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2010-10-04 11:09:17 +02:00			`while (i1 < s1 && XDL_ISSPACE(l1[i1]))`
xutils: Fix xdl_recmatch() on incomplete lines Thell Fowler noticed that various "ignore whitespace" options to git diff do not work well on an incomplete line. The loop control of the function responsible for these bugs was extremely difficult to follow. This patch restructures the loops for three variants of "ignore whitespace" logic. The basic idea of the re-written logic is: - A loop runs while the characters from both strings we are looking at match. We declare unmatch immediately when we find something that does not match and return false from the function. We break out of the loop if we ran out of either side of the string. The way we skip spaces inside this loop varies depending on the style of ignoring whitespaces. - After the above loop breaks, we know that the parts of the strings we inspected so far match, ignoring the whitespaces. The lines can match only if the remainder consists of nothing but whitespaces. This part of the logic is shared across all three styles. The new code is more obvious and should be much easier to follow. Tested-by: Thell Fowler <git@tbfowler.name> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-23 09:57:18 +02:00			`i1++;`
xdiff: cast arguments for ctype functions to unsigned char The ctype functions isspace(), isalnum(), et al take an integer argument representing an unsigned character, or -1 for EOF. On platforms with a signed char, it is unsafe to pass a char to them without casting it to unsigned char first. Most of git is already shielded against this by the ctype implementation in git-compat-util.h, but xdiff, which uses libc ctype.h, ought to be fixed. Noticed-by: der Mouse <mouse@Rodents-Montreal.ORG> Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2010-10-04 11:09:17 +02:00			`while (i2 < s2 && XDL_ISSPACE(l2[i2]))`
xutils: Fix xdl_recmatch() on incomplete lines Thell Fowler noticed that various "ignore whitespace" options to git diff do not work well on an incomplete line. The loop control of the function responsible for these bugs was extremely difficult to follow. This patch restructures the loops for three variants of "ignore whitespace" logic. The basic idea of the re-written logic is: - A loop runs while the characters from both strings we are looking at match. We declare unmatch immediately when we find something that does not match and return false from the function. We break out of the loop if we ran out of either side of the string. The way we skip spaces inside this loop varies depending on the style of ignoring whitespaces. - After the above loop breaks, we know that the parts of the strings we inspected so far match, ignoring the whitespaces. The lines can match only if the remainder consists of nothing but whitespaces. This part of the logic is shared across all three styles. The new code is more obvious and should be much easier to follow. Tested-by: Thell Fowler <git@tbfowler.name> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-23 09:57:18 +02:00			`i2++;`
Teach diff about -b and -w flags This adds -b (--ignore-space-change) and -w (--ignore-all-space) flags to diff. The main part of the patch is teaching libxdiff about it. [jc: renamed xdl_line_match() to xdl_recmatch() since the former is used for different purposes in xpatchi.c which is in the parts of the upstream source we do not use.] Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-06-14 17:40:23 +02:00			`}`
			`} else if (flags & XDF_IGNORE_WHITESPACE_CHANGE) {`
xutils: Fix xdl_recmatch() on incomplete lines Thell Fowler noticed that various "ignore whitespace" options to git diff do not work well on an incomplete line. The loop control of the function responsible for these bugs was extremely difficult to follow. This patch restructures the loops for three variants of "ignore whitespace" logic. The basic idea of the re-written logic is: - A loop runs while the characters from both strings we are looking at match. We declare unmatch immediately when we find something that does not match and return false from the function. We break out of the loop if we ran out of either side of the string. The way we skip spaces inside this loop varies depending on the style of ignoring whitespaces. - After the above loop breaks, we know that the parts of the strings we inspected so far match, ignoring the whitespaces. The lines can match only if the remainder consists of nothing but whitespaces. This part of the logic is shared across all three styles. The new code is more obvious and should be much easier to follow. Tested-by: Thell Fowler <git@tbfowler.name> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-23 09:57:18 +02:00			`while (i1 < s1 && i2 < s2) {`
xdiff: cast arguments for ctype functions to unsigned char The ctype functions isspace(), isalnum(), et al take an integer argument representing an unsigned character, or -1 for EOF. On platforms with a signed char, it is unsafe to pass a char to them without casting it to unsigned char first. Most of git is already shielded against this by the ctype implementation in git-compat-util.h, but xdiff, which uses libc ctype.h, ought to be fixed. Noticed-by: der Mouse <mouse@Rodents-Montreal.ORG> Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2010-10-04 11:09:17 +02:00			`if (XDL_ISSPACE(l1[i1]) && XDL_ISSPACE(l2[i2])) {`
xutils: Fix xdl_recmatch() on incomplete lines Thell Fowler noticed that various "ignore whitespace" options to git diff do not work well on an incomplete line. The loop control of the function responsible for these bugs was extremely difficult to follow. This patch restructures the loops for three variants of "ignore whitespace" logic. The basic idea of the re-written logic is: - A loop runs while the characters from both strings we are looking at match. We declare unmatch immediately when we find something that does not match and return false from the function. We break out of the loop if we ran out of either side of the string. The way we skip spaces inside this loop varies depending on the style of ignoring whitespaces. - After the above loop breaks, we know that the parts of the strings we inspected so far match, ignoring the whitespaces. The lines can match only if the remainder consists of nothing but whitespaces. This part of the logic is shared across all three styles. The new code is more obvious and should be much easier to follow. Tested-by: Thell Fowler <git@tbfowler.name> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-23 09:57:18 +02:00			`/* Skip matching spaces and try again */`
xdiff: cast arguments for ctype functions to unsigned char The ctype functions isspace(), isalnum(), et al take an integer argument representing an unsigned character, or -1 for EOF. On platforms with a signed char, it is unsafe to pass a char to them without casting it to unsigned char first. Most of git is already shielded against this by the ctype implementation in git-compat-util.h, but xdiff, which uses libc ctype.h, ought to be fixed. Noticed-by: der Mouse <mouse@Rodents-Montreal.ORG> Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2010-10-04 11:09:17 +02:00			`while (i1 < s1 && XDL_ISSPACE(l1[i1]))`
teach diff machinery about --ignore-space-at-eol `git diff --ignore-space-at-eol` will ignore whitespace at the line ends. Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net> 2007-02-14 01:30:29 +01:00			`i1++;`
xdiff: cast arguments for ctype functions to unsigned char The ctype functions isspace(), isalnum(), et al take an integer argument representing an unsigned character, or -1 for EOF. On platforms with a signed char, it is unsafe to pass a char to them without casting it to unsigned char first. Most of git is already shielded against this by the ctype implementation in git-compat-util.h, but xdiff, which uses libc ctype.h, ought to be fixed. Noticed-by: der Mouse <mouse@Rodents-Montreal.ORG> Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2010-10-04 11:09:17 +02:00			`while (i2 < s2 && XDL_ISSPACE(l2[i2]))`
teach diff machinery about --ignore-space-at-eol `git diff --ignore-space-at-eol` will ignore whitespace at the line ends. Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net> 2007-02-14 01:30:29 +01:00			`i2++;`
xutils: Fix xdl_recmatch() on incomplete lines Thell Fowler noticed that various "ignore whitespace" options to git diff do not work well on an incomplete line. The loop control of the function responsible for these bugs was extremely difficult to follow. This patch restructures the loops for three variants of "ignore whitespace" logic. The basic idea of the re-written logic is: - A loop runs while the characters from both strings we are looking at match. We declare unmatch immediately when we find something that does not match and return false from the function. We break out of the loop if we ran out of either side of the string. The way we skip spaces inside this loop varies depending on the style of ignoring whitespaces. - After the above loop breaks, we know that the parts of the strings we inspected so far match, ignoring the whitespaces. The lines can match only if the remainder consists of nothing but whitespaces. This part of the logic is shared across all three styles. The new code is more obvious and should be much easier to follow. Tested-by: Thell Fowler <git@tbfowler.name> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-23 09:57:18 +02:00			`continue;`
teach diff machinery about --ignore-space-at-eol `git diff --ignore-space-at-eol` will ignore whitespace at the line ends. Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net> 2007-02-14 01:30:29 +01:00			`}`
xutils: Fix xdl_recmatch() on incomplete lines Thell Fowler noticed that various "ignore whitespace" options to git diff do not work well on an incomplete line. The loop control of the function responsible for these bugs was extremely difficult to follow. This patch restructures the loops for three variants of "ignore whitespace" logic. The basic idea of the re-written logic is: - A loop runs while the characters from both strings we are looking at match. We declare unmatch immediately when we find something that does not match and return false from the function. We break out of the loop if we ran out of either side of the string. The way we skip spaces inside this loop varies depending on the style of ignoring whitespaces. - After the above loop breaks, we know that the parts of the strings we inspected so far match, ignoring the whitespaces. The lines can match only if the remainder consists of nothing but whitespaces. This part of the logic is shared across all three styles. The new code is more obvious and should be much easier to follow. Tested-by: Thell Fowler <git@tbfowler.name> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-23 09:57:18 +02:00			`if (l1[i1++] != l2[i2++])`
			`return 0;`
			`}`
			`} else if (flags & XDF_IGNORE_WHITESPACE_AT_EOL) {`
			`while (i1 < s1 && i2 < s2 && l1[i1++] == l2[i2++])`
			`; /* keep going */`
			`}`

			`/*`
			`* After running out of one side, the remaining side must have`
			`* nothing but whitespace for the lines to match. Note that`
			`* ignore-whitespace-at-eol case may break out of the loop`
			`* while there still are characters remaining on both lines.`
			`*/`
			`if (i1 < s1) {`
xdiff: cast arguments for ctype functions to unsigned char The ctype functions isspace(), isalnum(), et al take an integer argument representing an unsigned character, or -1 for EOF. On platforms with a signed char, it is unsafe to pass a char to them without casting it to unsigned char first. Most of git is already shielded against this by the ctype implementation in git-compat-util.h, but xdiff, which uses libc ctype.h, ought to be fixed. Noticed-by: der Mouse <mouse@Rodents-Montreal.ORG> Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2010-10-04 11:09:17 +02:00			`while (i1 < s1 && XDL_ISSPACE(l1[i1]))`
teach diff machinery about --ignore-space-at-eol `git diff --ignore-space-at-eol` will ignore whitespace at the line ends. Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net> 2007-02-14 01:30:29 +01:00			`i1++;`
xutils: Fix xdl_recmatch() on incomplete lines Thell Fowler noticed that various "ignore whitespace" options to git diff do not work well on an incomplete line. The loop control of the function responsible for these bugs was extremely difficult to follow. This patch restructures the loops for three variants of "ignore whitespace" logic. The basic idea of the re-written logic is: - A loop runs while the characters from both strings we are looking at match. We declare unmatch immediately when we find something that does not match and return false from the function. We break out of the loop if we ran out of either side of the string. The way we skip spaces inside this loop varies depending on the style of ignoring whitespaces. - After the above loop breaks, we know that the parts of the strings we inspected so far match, ignoring the whitespaces. The lines can match only if the remainder consists of nothing but whitespaces. This part of the logic is shared across all three styles. The new code is more obvious and should be much easier to follow. Tested-by: Thell Fowler <git@tbfowler.name> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-23 09:57:18 +02:00			`if (s1 != i1)`
			`return 0;`
			`}`
			`if (i2 < s2) {`
xdiff: cast arguments for ctype functions to unsigned char The ctype functions isspace(), isalnum(), et al take an integer argument representing an unsigned character, or -1 for EOF. On platforms with a signed char, it is unsafe to pass a char to them without casting it to unsigned char first. Most of git is already shielded against this by the ctype implementation in git-compat-util.h, but xdiff, which uses libc ctype.h, ought to be fixed. Noticed-by: der Mouse <mouse@Rodents-Montreal.ORG> Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2010-10-04 11:09:17 +02:00			`while (i2 < s2 && XDL_ISSPACE(l2[i2]))`
teach diff machinery about --ignore-space-at-eol `git diff --ignore-space-at-eol` will ignore whitespace at the line ends. Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net> 2007-02-14 01:30:29 +01:00			`i2++;`
xutils: Fix xdl_recmatch() on incomplete lines Thell Fowler noticed that various "ignore whitespace" options to git diff do not work well on an incomplete line. The loop control of the function responsible for these bugs was extremely difficult to follow. This patch restructures the loops for three variants of "ignore whitespace" logic. The basic idea of the re-written logic is: - A loop runs while the characters from both strings we are looking at match. We declare unmatch immediately when we find something that does not match and return false from the function. We break out of the loop if we ran out of either side of the string. The way we skip spaces inside this loop varies depending on the style of ignoring whitespaces. - After the above loop breaks, we know that the parts of the strings we inspected so far match, ignoring the whitespaces. The lines can match only if the remainder consists of nothing but whitespaces. This part of the logic is shared across all three styles. The new code is more obvious and should be much easier to follow. Tested-by: Thell Fowler <git@tbfowler.name> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-23 09:57:18 +02:00			`return (s2 == i2);`
			`}`
			`return 1;`
Teach diff about -b and -w flags This adds -b (--ignore-space-change) and -w (--ignore-all-space) flags to diff. The main part of the patch is teaching libxdiff about it. [jc: renamed xdl_line_match() to xdl_recmatch() since the former is used for different purposes in xpatchi.c which is in the parts of the upstream source we do not use.] Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-06-14 17:40:23 +02:00			`}`
Use a real built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 05:13:22 +01:00
xdiff/xutils.c(xdl_hash_record): factor out whitespace handling Since in at least one use case, xdl_hash_record() takes over 15% of the CPU time, it makes sense to even micro-optimize it. For many cases, no whitespace special handling is needed, and in these cases we should not even bother to check for whitespace in _every_ iteration of the loop. Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net> 2007-03-20 04:05:10 +01:00			`static unsigned long xdl_hash_record_with_whitespace(char const **data,`
			`char const *top, long flags) {`
Use a real built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 05:13:22 +01:00			`unsigned long ha = 5381;`
			`char const ptr = data;`

			`for (; ptr < top && *ptr != '\n'; ptr++) {`
xdiff: cast arguments for ctype functions to unsigned char The ctype functions isspace(), isalnum(), et al take an integer argument representing an unsigned character, or -1 for EOF. On platforms with a signed char, it is unsafe to pass a char to them without casting it to unsigned char first. Most of git is already shielded against this by the ctype implementation in git-compat-util.h, but xdiff, which uses libc ctype.h, ought to be fixed. Noticed-by: der Mouse <mouse@Rodents-Montreal.ORG> Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2010-10-04 11:09:17 +02:00			`if (XDL_ISSPACE(*ptr)) {`
teach diff machinery about --ignore-space-at-eol `git diff --ignore-space-at-eol` will ignore whitespace at the line ends. Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net> 2007-02-14 01:30:29 +01:00			`const char *ptr2 = ptr;`
xutils: Fix hashing an incomplete line with whitespaces at the end Upon seeing a whitespace, xdl_hash_record_with_whitespace() first skipped the run of whitespaces (excluding LF) that begins there, ensuring that the pointer points at the last whitespace character in the run, and assumed that the next character must be LF at the end of the line. This does not work when hashing an incomplete line, which lacks the LF at the end. Introduce "at_eol" variable that is true when either we are at the end of line (looking at LF) or at the end of an incomplete line, and use that instead throughout the code. Noticed by Thell Fowler. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-23 09:51:09 +02:00			`int at_eol;`
xdiff: cast arguments for ctype functions to unsigned char The ctype functions isspace(), isalnum(), et al take an integer argument representing an unsigned character, or -1 for EOF. On platforms with a signed char, it is unsafe to pass a char to them without casting it to unsigned char first. Most of git is already shielded against this by the ctype implementation in git-compat-util.h, but xdiff, which uses libc ctype.h, ought to be fixed. Noticed-by: der Mouse <mouse@Rodents-Montreal.ORG> Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2010-10-04 11:09:17 +02:00			`while (ptr + 1 < top && XDL_ISSPACE(ptr[1])`
diff: fix 2 whitespace issues When whitespace or whitespace change was ignored, the function xdl_recmatch() returned memcmp() style differences, which is wrong, since it should return 0 on non-match. Also, there were three horrible off-by-one bugs, even leading to wrong hashes in the whitespace special handling. The issue was noticed by Ray Lehtiniemi. For good measure, this commit adds a test. Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-10-12 14:22:14 +02:00			`&& ptr[1] != '\n')`
Teach diff about -b and -w flags This adds -b (--ignore-space-change) and -w (--ignore-all-space) flags to diff. The main part of the patch is teaching libxdiff about it. [jc: renamed xdl_line_match() to xdl_recmatch() since the former is used for different purposes in xpatchi.c which is in the parts of the upstream source we do not use.] Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-06-14 17:40:23 +02:00			`ptr++;`
xutils: Fix hashing an incomplete line with whitespaces at the end Upon seeing a whitespace, xdl_hash_record_with_whitespace() first skipped the run of whitespaces (excluding LF) that begins there, ensuring that the pointer points at the last whitespace character in the run, and assumed that the next character must be LF at the end of the line. This does not work when hashing an incomplete line, which lacks the LF at the end. Introduce "at_eol" variable that is true when either we are at the end of line (looking at LF) or at the end of an incomplete line, and use that instead throughout the code. Noticed by Thell Fowler. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-23 09:51:09 +02:00			`at_eol = (top <= ptr + 1 \|\| ptr[1] == '\n');`
Fix combined use of whitespace ignore options to diff The code used to misbehave when options to ignore certain whitespaces (-w -b and --ignore-at-eol) were combined. Signed-off-by: Keith Cascio <keith@cs.ucla.edu> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-01-19 19:03:04 +01:00			`if (flags & XDF_IGNORE_WHITESPACE)`
			`; /* already handled */`
			`else if (flags & XDF_IGNORE_WHITESPACE_CHANGE`
xutils: Fix hashing an incomplete line with whitespaces at the end Upon seeing a whitespace, xdl_hash_record_with_whitespace() first skipped the run of whitespaces (excluding LF) that begins there, ensuring that the pointer points at the last whitespace character in the run, and assumed that the next character must be LF at the end of the line. This does not work when hashing an incomplete line, which lacks the LF at the end. Introduce "at_eol" variable that is true when either we are at the end of line (looking at LF) or at the end of an incomplete line, and use that instead throughout the code. Noticed by Thell Fowler. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-23 09:51:09 +02:00			`&& !at_eol) {`
Teach diff about -b and -w flags This adds -b (--ignore-space-change) and -w (--ignore-all-space) flags to diff. The main part of the patch is teaching libxdiff about it. [jc: renamed xdl_line_match() to xdl_recmatch() since the former is used for different purposes in xpatchi.c which is in the parts of the upstream source we do not use.] Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-06-14 17:40:23 +02:00			`ha += (ha << 5);`
			`ha ^= (unsigned long) ' ';`
			`}`
Fix combined use of whitespace ignore options to diff The code used to misbehave when options to ignore certain whitespaces (-w -b and --ignore-at-eol) were combined. Signed-off-by: Keith Cascio <keith@cs.ucla.edu> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-01-19 19:03:04 +01:00			`else if (flags & XDF_IGNORE_WHITESPACE_AT_EOL`
xutils: Fix hashing an incomplete line with whitespaces at the end Upon seeing a whitespace, xdl_hash_record_with_whitespace() first skipped the run of whitespaces (excluding LF) that begins there, ensuring that the pointer points at the last whitespace character in the run, and assumed that the next character must be LF at the end of the line. This does not work when hashing an incomplete line, which lacks the LF at the end. Introduce "at_eol" variable that is true when either we are at the end of line (looking at LF) or at the end of an incomplete line, and use that instead throughout the code. Noticed by Thell Fowler. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-08-23 09:51:09 +02:00			`&& !at_eol) {`
teach diff machinery about --ignore-space-at-eol `git diff --ignore-space-at-eol` will ignore whitespace at the line ends. Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net> 2007-02-14 01:30:29 +01:00			`while (ptr2 != ptr + 1) {`
			`ha += (ha << 5);`
			`ha ^= (unsigned long) *ptr2;`
			`ptr2++;`
			`}`
			`}`
Teach diff about -b and -w flags This adds -b (--ignore-space-change) and -w (--ignore-all-space) flags to diff. The main part of the patch is teaching libxdiff about it. [jc: renamed xdl_line_match() to xdl_recmatch() since the former is used for different purposes in xpatchi.c which is in the parts of the upstream source we do not use.] Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-06-14 17:40:23 +02:00			`continue;`
			`}`
Use a real built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 05:13:22 +01:00			`ha += (ha << 5);`
			`ha ^= (unsigned long) *ptr;`
			`}`
			`*data = ptr < top ? ptr + 1: ptr;`

			`return ha;`
			`}`

xdiff: load full words in the inner loop of xdl_hash_record Redo the hashing loop in xdl_hash_record in a way that loads an entire 'long' at a time, using masking tricks to see when and where we found the terminating '\n'. I stole inspiration and code from the posts by Linus Torvalds around https://lkml.org/lkml/2012/3/2/452 https://lkml.org/lkml/2012/3/5/6 His method reads the buffers in sizeof(long) increments, and may thus overrun it by at most sizeof(long)-1 bytes before it sees the final newline (or hits the buffer length check). I considered padding out all buffers by a suitable amount to "catch" the overrun, but * this does not work for mmap()'d buffers: if you map 4096+8 bytes from a 4096 byte file, accessing the last 8 bytes results in a SIGBUS on my machine; and * it would also be extremely ugly because it intrudes deep into the unpacking machinery. So I adapted it to not read beyond the buffer at all. Instead, it reads the final partial word byte-by-byte and strings it together. Then it can use the same logic as before to finish the hashing. So far we enable this only on x86_64, where it provides nice speedup for diff-related work: Test origin/next tr/xdiff-fast-hash ----------------------------------------------------------------------------- 4000.1: log -3000 (baseline) 0.07(0.05+0.02) 0.08(0.06+0.02) +14.3% 4000.2: log --raw -3000 (tree-only) 0.37(0.33+0.04) 0.37(0.32+0.04) +0.0% 4000.3: log -p -3000 (Myers) 1.75(1.65+0.09) 1.60(1.49+0.10) -8.6% 4000.4: log -p -3000 --histogram 1.73(1.62+0.09) 1.58(1.49+0.08) -8.7% 4000.5: log -p -3000 --patience 2.11(2.00+0.10) 1.94(1.80+0.11) -8.1% Perhaps other platforms could also benefit. However it does NOT work on big-endian systems! [jc: minimum style and compilation fixes] Signed-off-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2012-04-06 23:01:23 +02:00			`#ifdef XDL_FAST_HASH`

xdiff: avoid compiler warnings with XDL_FAST_HASH on 32-bit machines Import macro REPEAT_BYTE from Linux (arch/x86/include/asm/word-at-a-time.h) to avoid 64-bit integer literals, which cause some 32-bit compilers to print warnings. Reported-by: Øyvind A. Holm <sunny@sunbase.org> Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2012-05-22 22:36:25 +02:00			`#define REPEAT_BYTE(x) ((~0ul / 0xff) * (x))`

			`#define ONEBYTES REPEAT_BYTE(0x01)`
			`#define NEWLINEBYTES REPEAT_BYTE(0x0a)`
			`#define HIGHBITS REPEAT_BYTE(0x80)`
xdiff: load full words in the inner loop of xdl_hash_record Redo the hashing loop in xdl_hash_record in a way that loads an entire 'long' at a time, using masking tricks to see when and where we found the terminating '\n'. I stole inspiration and code from the posts by Linus Torvalds around https://lkml.org/lkml/2012/3/2/452 https://lkml.org/lkml/2012/3/5/6 His method reads the buffers in sizeof(long) increments, and may thus overrun it by at most sizeof(long)-1 bytes before it sees the final newline (or hits the buffer length check). I considered padding out all buffers by a suitable amount to "catch" the overrun, but * this does not work for mmap()'d buffers: if you map 4096+8 bytes from a 4096 byte file, accessing the last 8 bytes results in a SIGBUS on my machine; and * it would also be extremely ugly because it intrudes deep into the unpacking machinery. So I adapted it to not read beyond the buffer at all. Instead, it reads the final partial word byte-by-byte and strings it together. Then it can use the same logic as before to finish the hashing. So far we enable this only on x86_64, where it provides nice speedup for diff-related work: Test origin/next tr/xdiff-fast-hash ----------------------------------------------------------------------------- 4000.1: log -3000 (baseline) 0.07(0.05+0.02) 0.08(0.06+0.02) +14.3% 4000.2: log --raw -3000 (tree-only) 0.37(0.33+0.04) 0.37(0.32+0.04) +0.0% 4000.3: log -p -3000 (Myers) 1.75(1.65+0.09) 1.60(1.49+0.10) -8.6% 4000.4: log -p -3000 --histogram 1.73(1.62+0.09) 1.58(1.49+0.08) -8.7% 4000.5: log -p -3000 --patience 2.11(2.00+0.10) 1.94(1.80+0.11) -8.1% Perhaps other platforms could also benefit. However it does NOT work on big-endian systems! [jc: minimum style and compilation fixes] Signed-off-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2012-04-06 23:01:23 +02:00
			`/* Return the high bit set in the first byte that is a zero */`
			`static inline unsigned long has_zero(unsigned long a)`
			`{`
			`return ((a - ONEBYTES) & ~a) & HIGHBITS;`
			`}`

			`static inline long count_masked_bytes(unsigned long mask)`
			`{`
xdiff: choose XDL_FAST_HASH code on sizeof(long) instead of __WORDSIZE Darwin does not define __WORDSIZE, and compiles the 32-bit code path on 64-bit systems, resulting in a totally broken git. I could not find an alternative -- other than the platform symbols (__x86_64__ etc.) -- that does the test in the preprocessor. However, we can also just test for the size of a 'long', which is what really matters here. Any compiler worth its salt will leave only the branch relevant for its platform, and indeed on Linux/GCC the numbers don't change: Test tr/darwin-xdl-fast-hash origin/next origin/master ------------------------------------------------------------------------------------------------------------------ 4000.1: log -3000 (baseline) 0.09(0.07+0.01) 0.09(0.07+0.01) -5.5%* 0.09(0.07+0.01) -4.1% 4000.2: log --raw -3000 (tree-only) 0.47(0.41+0.05) 0.47(0.40+0.05) -0.5% 0.45(0.38+0.06) -3.5%. 4000.3: log -p -3000 (Myers) 1.81(1.67+0.12) 1.81(1.67+0.13) +0.3% 1.99(1.84+0.12) +10.2%* 4000.4: log -p -3000 --histogram 1.79(1.66+0.11) 1.80(1.67+0.11) +0.4% 1.96(1.82+0.10) +9.2%* 4000.5: log -p -3000 --patience 2.17(2.02+0.13) 2.20(2.04+0.13) +1.3%. 2.33(2.18+0.13) +7.4%*** ------------------------------------------------------------------------------------------------------------------ Significance hints: '.' 0.1 '' 0.05 '' 0.01 '**' 0.001 Noticed-by: Brian Gernhardt <brian@gernhardtsoftware.com> Signed-off-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2012-05-01 12:23:20 +02:00			`if (sizeof(long) == 8) {`
			`/*`
			`* Jan Achrenius on G+: microoptimized version of`
			`* the simpler "(mask & ONEBYTES) * ONEBYTES >> 56"`
			`* that works for the bytemasks without having to`
			`* mask them first.`
			`*/`
			`/*`
xdiff: avoid more compiler warnings with XDL_FAST_HASH on 32-bit machines Hide literals that can cause compiler warnings for 32-bit architectures in expressions that evaluate to small numbers there. Some compilers warn that 0x0001020304050608 won't fit into a 32-bit long, others that shifting right by 56 bits clears a 32-bit value completely. The correct values are calculated in the 64-bit case, which is all that matters in this if-branch. Reported-by: Øyvind A. Holm <sunny@sunbase.org> Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Acked-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2012-05-22 22:36:39 +02:00			`* return mask * 0x0001020304050608 >> 56;`
xdiff: choose XDL_FAST_HASH code on sizeof(long) instead of __WORDSIZE Darwin does not define __WORDSIZE, and compiles the 32-bit code path on 64-bit systems, resulting in a totally broken git. I could not find an alternative -- other than the platform symbols (__x86_64__ etc.) -- that does the test in the preprocessor. However, we can also just test for the size of a 'long', which is what really matters here. Any compiler worth its salt will leave only the branch relevant for its platform, and indeed on Linux/GCC the numbers don't change: Test tr/darwin-xdl-fast-hash origin/next origin/master ------------------------------------------------------------------------------------------------------------------ 4000.1: log -3000 (baseline) 0.09(0.07+0.01) 0.09(0.07+0.01) -5.5%* 0.09(0.07+0.01) -4.1% 4000.2: log --raw -3000 (tree-only) 0.47(0.41+0.05) 0.47(0.40+0.05) -0.5% 0.45(0.38+0.06) -3.5%. 4000.3: log -p -3000 (Myers) 1.81(1.67+0.12) 1.81(1.67+0.13) +0.3% 1.99(1.84+0.12) +10.2%* 4000.4: log -p -3000 --histogram 1.79(1.66+0.11) 1.80(1.67+0.11) +0.4% 1.96(1.82+0.10) +9.2%* 4000.5: log -p -3000 --patience 2.17(2.02+0.13) 2.20(2.04+0.13) +1.3%. 2.33(2.18+0.13) +7.4%*** ------------------------------------------------------------------------------------------------------------------ Significance hints: '.' 0.1 '' 0.05 '' 0.01 '**' 0.001 Noticed-by: Brian Gernhardt <brian@gernhardtsoftware.com> Signed-off-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2012-05-01 12:23:20 +02:00			`*`
xdiff: avoid more compiler warnings with XDL_FAST_HASH on 32-bit machines Hide literals that can cause compiler warnings for 32-bit architectures in expressions that evaluate to small numbers there. Some compilers warn that 0x0001020304050608 won't fit into a 32-bit long, others that shifting right by 56 bits clears a 32-bit value completely. The correct values are calculated in the 64-bit case, which is all that matters in this if-branch. Reported-by: Øyvind A. Holm <sunny@sunbase.org> Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Acked-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2012-05-22 22:36:39 +02:00			`* Doing it like this avoids warnings on 32-bit machines.`
xdiff: choose XDL_FAST_HASH code on sizeof(long) instead of __WORDSIZE Darwin does not define __WORDSIZE, and compiles the 32-bit code path on 64-bit systems, resulting in a totally broken git. I could not find an alternative -- other than the platform symbols (__x86_64__ etc.) -- that does the test in the preprocessor. However, we can also just test for the size of a 'long', which is what really matters here. Any compiler worth its salt will leave only the branch relevant for its platform, and indeed on Linux/GCC the numbers don't change: Test tr/darwin-xdl-fast-hash origin/next origin/master ------------------------------------------------------------------------------------------------------------------ 4000.1: log -3000 (baseline) 0.09(0.07+0.01) 0.09(0.07+0.01) -5.5%* 0.09(0.07+0.01) -4.1% 4000.2: log --raw -3000 (tree-only) 0.47(0.41+0.05) 0.47(0.40+0.05) -0.5% 0.45(0.38+0.06) -3.5%. 4000.3: log -p -3000 (Myers) 1.81(1.67+0.12) 1.81(1.67+0.13) +0.3% 1.99(1.84+0.12) +10.2%* 4000.4: log -p -3000 --histogram 1.79(1.66+0.11) 1.80(1.67+0.11) +0.4% 1.96(1.82+0.10) +9.2%* 4000.5: log -p -3000 --patience 2.17(2.02+0.13) 2.20(2.04+0.13) +1.3%. 2.33(2.18+0.13) +7.4%*** ------------------------------------------------------------------------------------------------------------------ Significance hints: '.' 0.1 '' 0.05 '' 0.01 '**' 0.001 Noticed-by: Brian Gernhardt <brian@gernhardtsoftware.com> Signed-off-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2012-05-01 12:23:20 +02:00			`*/`
xdiff: avoid more compiler warnings with XDL_FAST_HASH on 32-bit machines Hide literals that can cause compiler warnings for 32-bit architectures in expressions that evaluate to small numbers there. Some compilers warn that 0x0001020304050608 won't fit into a 32-bit long, others that shifting right by 56 bits clears a 32-bit value completely. The correct values are calculated in the 64-bit case, which is all that matters in this if-branch. Reported-by: Øyvind A. Holm <sunny@sunbase.org> Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Acked-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2012-05-22 22:36:39 +02:00			`long a = (REPEAT_BYTE(0x01) / 0xff + 1);`
			`return mask * a >> (sizeof(long) * 7);`
xdiff: choose XDL_FAST_HASH code on sizeof(long) instead of __WORDSIZE Darwin does not define __WORDSIZE, and compiles the 32-bit code path on 64-bit systems, resulting in a totally broken git. I could not find an alternative -- other than the platform symbols (__x86_64__ etc.) -- that does the test in the preprocessor. However, we can also just test for the size of a 'long', which is what really matters here. Any compiler worth its salt will leave only the branch relevant for its platform, and indeed on Linux/GCC the numbers don't change: Test tr/darwin-xdl-fast-hash origin/next origin/master ------------------------------------------------------------------------------------------------------------------ 4000.1: log -3000 (baseline) 0.09(0.07+0.01) 0.09(0.07+0.01) -5.5%* 0.09(0.07+0.01) -4.1% 4000.2: log --raw -3000 (tree-only) 0.47(0.41+0.05) 0.47(0.40+0.05) -0.5% 0.45(0.38+0.06) -3.5%. 4000.3: log -p -3000 (Myers) 1.81(1.67+0.12) 1.81(1.67+0.13) +0.3% 1.99(1.84+0.12) +10.2%* 4000.4: log -p -3000 --histogram 1.79(1.66+0.11) 1.80(1.67+0.11) +0.4% 1.96(1.82+0.10) +9.2%* 4000.5: log -p -3000 --patience 2.17(2.02+0.13) 2.20(2.04+0.13) +1.3%. 2.33(2.18+0.13) +7.4%*** ------------------------------------------------------------------------------------------------------------------ Significance hints: '.' 0.1 '' 0.05 '' 0.01 '**' 0.001 Noticed-by: Brian Gernhardt <brian@gernhardtsoftware.com> Signed-off-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2012-05-01 12:23:20 +02:00			`} else {`
xdiff: import new 32-bit version of count_masked_bytes() Import the latest 32-bit implementation of count_masked_bytes() from Linux (arch/x86/include/asm/word-at-a-time.h). It's shorter and avoids overflows and negative numbers. This fixes test failures on 32-bit, where negative partial results had been shifted right using the "wrong" method (logical shift right instead of arithmetic short right). The compiler is free to chose the method, so it was only wrong in the sense that it didn't work as intended by us. Reported-by: Øyvind A. Holm <sunny@sunbase.org> Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2012-05-22 22:36:57 +02:00			`/* Carl Chatfield / Jan Achrenius G+ version for 32-bit */`
			`/* (000000 0000ff 00ffff ffffff) -> ( 1 1 2 3 ) */`
			`long a = (0x0ff0001 + mask) >> 23;`
			`/* Fix the 1 for 00 case */`
			`return a & mask;`
xdiff: choose XDL_FAST_HASH code on sizeof(long) instead of __WORDSIZE Darwin does not define __WORDSIZE, and compiles the 32-bit code path on 64-bit systems, resulting in a totally broken git. I could not find an alternative -- other than the platform symbols (__x86_64__ etc.) -- that does the test in the preprocessor. However, we can also just test for the size of a 'long', which is what really matters here. Any compiler worth its salt will leave only the branch relevant for its platform, and indeed on Linux/GCC the numbers don't change: Test tr/darwin-xdl-fast-hash origin/next origin/master ------------------------------------------------------------------------------------------------------------------ 4000.1: log -3000 (baseline) 0.09(0.07+0.01) 0.09(0.07+0.01) -5.5%* 0.09(0.07+0.01) -4.1% 4000.2: log --raw -3000 (tree-only) 0.47(0.41+0.05) 0.47(0.40+0.05) -0.5% 0.45(0.38+0.06) -3.5%. 4000.3: log -p -3000 (Myers) 1.81(1.67+0.12) 1.81(1.67+0.13) +0.3% 1.99(1.84+0.12) +10.2%* 4000.4: log -p -3000 --histogram 1.79(1.66+0.11) 1.80(1.67+0.11) +0.4% 1.96(1.82+0.10) +9.2%* 4000.5: log -p -3000 --patience 2.17(2.02+0.13) 2.20(2.04+0.13) +1.3%. 2.33(2.18+0.13) +7.4%*** ------------------------------------------------------------------------------------------------------------------ Significance hints: '.' 0.1 '' 0.05 '' 0.01 '**' 0.001 Noticed-by: Brian Gernhardt <brian@gernhardtsoftware.com> Signed-off-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2012-05-01 12:23:20 +02:00			`}`
xdiff: load full words in the inner loop of xdl_hash_record Redo the hashing loop in xdl_hash_record in a way that loads an entire 'long' at a time, using masking tricks to see when and where we found the terminating '\n'. I stole inspiration and code from the posts by Linus Torvalds around https://lkml.org/lkml/2012/3/2/452 https://lkml.org/lkml/2012/3/5/6 His method reads the buffers in sizeof(long) increments, and may thus overrun it by at most sizeof(long)-1 bytes before it sees the final newline (or hits the buffer length check). I considered padding out all buffers by a suitable amount to "catch" the overrun, but * this does not work for mmap()'d buffers: if you map 4096+8 bytes from a 4096 byte file, accessing the last 8 bytes results in a SIGBUS on my machine; and * it would also be extremely ugly because it intrudes deep into the unpacking machinery. So I adapted it to not read beyond the buffer at all. Instead, it reads the final partial word byte-by-byte and strings it together. Then it can use the same logic as before to finish the hashing. So far we enable this only on x86_64, where it provides nice speedup for diff-related work: Test origin/next tr/xdiff-fast-hash ----------------------------------------------------------------------------- 4000.1: log -3000 (baseline) 0.07(0.05+0.02) 0.08(0.06+0.02) +14.3% 4000.2: log --raw -3000 (tree-only) 0.37(0.33+0.04) 0.37(0.32+0.04) +0.0% 4000.3: log -p -3000 (Myers) 1.75(1.65+0.09) 1.60(1.49+0.10) -8.6% 4000.4: log -p -3000 --histogram 1.73(1.62+0.09) 1.58(1.49+0.08) -8.7% 4000.5: log -p -3000 --patience 2.11(2.00+0.10) 1.94(1.80+0.11) -8.1% Perhaps other platforms could also benefit. However it does NOT work on big-endian systems! [jc: minimum style and compilation fixes] Signed-off-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2012-04-06 23:01:23 +02:00			`}`

			`unsigned long xdl_hash_record(char const *data, char const top, long flags)`
			`{`
			`unsigned long hash = 5381;`
			`unsigned long a = 0, mask = 0;`
			`char const ptr = data;`
			`char const *end = top - sizeof(unsigned long) + 1;`

			`if (flags & XDF_WHITESPACE_FLAGS)`
			`return xdl_hash_record_with_whitespace(data, top, flags);`

			`ptr -= sizeof(unsigned long);`
			`do {`
			`hash += hash << 5;`
			`hash ^= a;`
			`ptr += sizeof(unsigned long);`
			`if (ptr >= end)`
			`break;`
			`a = (unsigned long )ptr;`
			`/* Do we have any '\n' bytes in this word? */`
			`mask = has_zero(a ^ NEWLINEBYTES);`
			`} while (!mask);`

			`if (ptr >= end) {`
			`/*`
			`* There is only a partial word left at the end of the`
			`* buffer. Because we may work with a memory mapping,`
			`* we have to grab the rest byte by byte instead of`
			`* blindly reading it.`
			`*`
			`* To avoid problems with masking in a signed value,`
			`* we use an unsigned char here.`
			`*/`
			`const char *p;`
			`for (p = top - 1; p >= ptr; p--)`
			`a = (a << 8) + ((const unsigned char )p);`
			`mask = has_zero(a ^ NEWLINEBYTES);`
			`if (!mask)`
			`/*`
			`* No '\n' found in the partial word. Make a`
			`* mask that matches what we read.`
			`*/`
			`mask = 1UL << (8 * (top - ptr) + 7);`
			`}`

			`/* The mask below the first high bit set */`
			`mask = (mask - 1) & ~mask;`
			`mask >>= 7;`
			`hash += hash << 5;`
			`hash ^= a & mask;`

			`/* Advance past the last (possibly partial) word */`
			`ptr += count_masked_bytes(mask);`

			`if (ptr < top) {`
			`assert(*ptr == '\n');`
			`ptr++;`
			`}`

			`*data = ptr;`

			`return hash;`
			`}`

			`#else /* XDL_FAST_HASH */`
Use a real built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 05:13:22 +01:00
xdiff/xutils.c(xdl_hash_record): factor out whitespace handling Since in at least one use case, xdl_hash_record() takes over 15% of the CPU time, it makes sense to even micro-optimize it. For many cases, no whitespace special handling is needed, and in these cases we should not even bother to check for whitespace in _every_ iteration of the loop. Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net> 2007-03-20 04:05:10 +01:00			`unsigned long xdl_hash_record(char const *data, char const top, long flags) {`
			`unsigned long ha = 5381;`
			`char const ptr = data;`

			`if (flags & XDF_WHITESPACE_FLAGS)`
			`return xdl_hash_record_with_whitespace(data, top, flags);`

			`for (; ptr < top && *ptr != '\n'; ptr++) {`
			`ha += (ha << 5);`
			`ha ^= (unsigned long) *ptr;`
			`}`
			`*data = ptr < top ? ptr + 1: ptr;`

			`return ha;`
			`}`

xdiff: load full words in the inner loop of xdl_hash_record Redo the hashing loop in xdl_hash_record in a way that loads an entire 'long' at a time, using masking tricks to see when and where we found the terminating '\n'. I stole inspiration and code from the posts by Linus Torvalds around https://lkml.org/lkml/2012/3/2/452 https://lkml.org/lkml/2012/3/5/6 His method reads the buffers in sizeof(long) increments, and may thus overrun it by at most sizeof(long)-1 bytes before it sees the final newline (or hits the buffer length check). I considered padding out all buffers by a suitable amount to "catch" the overrun, but * this does not work for mmap()'d buffers: if you map 4096+8 bytes from a 4096 byte file, accessing the last 8 bytes results in a SIGBUS on my machine; and * it would also be extremely ugly because it intrudes deep into the unpacking machinery. So I adapted it to not read beyond the buffer at all. Instead, it reads the final partial word byte-by-byte and strings it together. Then it can use the same logic as before to finish the hashing. So far we enable this only on x86_64, where it provides nice speedup for diff-related work: Test origin/next tr/xdiff-fast-hash ----------------------------------------------------------------------------- 4000.1: log -3000 (baseline) 0.07(0.05+0.02) 0.08(0.06+0.02) +14.3% 4000.2: log --raw -3000 (tree-only) 0.37(0.33+0.04) 0.37(0.32+0.04) +0.0% 4000.3: log -p -3000 (Myers) 1.75(1.65+0.09) 1.60(1.49+0.10) -8.6% 4000.4: log -p -3000 --histogram 1.73(1.62+0.09) 1.58(1.49+0.08) -8.7% 4000.5: log -p -3000 --patience 2.11(2.00+0.10) 1.94(1.80+0.11) -8.1% Perhaps other platforms could also benefit. However it does NOT work on big-endian systems! [jc: minimum style and compilation fixes] Signed-off-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2012-04-06 23:01:23 +02:00			`#endif /* XDL_FAST_HASH */`
xdiff/xutils.c(xdl_hash_record): factor out whitespace handling Since in at least one use case, xdl_hash_record() takes over 15% of the CPU time, it makes sense to even micro-optimize it. For many cases, no whitespace special handling is needed, and in these cases we should not even bother to check for whitespace in _every_ iteration of the loop. Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net> 2007-03-20 04:05:10 +01:00
Use a real built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 05:13:22 +01:00			`unsigned int xdl_hashbits(unsigned int size) {`
			`unsigned int val = 1, bits = 0;`

			`for (; val < size && bits < CHAR_BIT * sizeof(unsigned int); val <<= 1, bits++);`
			`return bits ? bits: 1;`
			`}`


			`int xdl_num_out(char *out, long val) {`
			`char ptr, str = out;`
			`char buf[32];`

			`ptr = buf + sizeof(buf) - 1;`
			`*ptr = '\0';`
			`if (val < 0) {`
			`*--ptr = '-';`
			`val = -val;`
			`}`
			`for (; val && ptr > buf; val /= 10)`
			`*--ptr = "0123456789"[val % 10];`
			`if (*ptr)`
			`for (; *ptr; ptr++, str++)`
			`str = ptr;`
			`else`
			`*str++ = '0';`
			`*str = '\0';`

			`return str - out;`
			`}`

xdiff: Show function names in hunk headers. The speed of the built-in diff generator is nice; but the function names shown by `diff -p' are /really/ nice. And I hate having to choose. So, we hack xdiff to find the function names and print them. xdiff has grown a flag to say whether to dig up the function names. The builtin_diff function passes this flag unconditionally. I suppose it could parse GIT_DIFF_OPTS, but it doesn't at the moment. I've also reintroduced the `function name' into the test suite, from which it was removed in commit 3ce8f089. The function names are parsed by a particularly stupid algorithm at the moment: it just tries to find a line in the `old' file, from before the start of the hunk, whose first character looks plausible. Still, it's most definitely a start. Signed-off-by: Mark Wooding <mdw@distorted.org.uk> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-28 04:23:31 +02:00			`int xdl_emit_hunk_hdr(long s1, long c1, long s2, long c2,`
			`const char func, long funclen, xdemitcb_t ecb) {`
Use a real built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 05:13:22 +01:00			`int nb = 0;`
			`mmbuffer_t mb;`
			`char buf[128];`

			`memcpy(buf, "@@ -", 4);`
			`nb += 4;`

Clean-up trivially redundant diff. Also corrects the line numbers in unified output when using zero lines context. 2006-04-04 03:47:55 +02:00			`nb += xdl_num_out(buf + nb, c1 ? s1: s1 - 1);`
Use a real built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 05:13:22 +01:00
built-in diff: minimum tweaks This fixes up a couple of minor issues with the real built-in diff to be more usable: - Omit ---/+++ header unless we emit diff output; - Detect and punt binary diff like GNU does; - Honor GIT_DIFF_OPTS minimally (only -u<number> and --unified=<number> are currently supported); - Omit line count of 1 from "@@ -l,k +m,n @@" hunk header (i.e. when k == 1 or n == 1) - Adjust testsuite for the lack of -p support. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 21:16:17 +01:00			`if (c1 != 1) {`
			`memcpy(buf + nb, ",", 1);`
			`nb += 1;`
Use a real built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 05:13:22 +01:00
built-in diff: minimum tweaks This fixes up a couple of minor issues with the real built-in diff to be more usable: - Omit ---/+++ header unless we emit diff output; - Detect and punt binary diff like GNU does; - Honor GIT_DIFF_OPTS minimally (only -u<number> and --unified=<number> are currently supported); - Omit line count of 1 from "@@ -l,k +m,n @@" hunk header (i.e. when k == 1 or n == 1) - Adjust testsuite for the lack of -p support. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 21:16:17 +01:00			`nb += xdl_num_out(buf + nb, c1);`
			`}`
Use a real built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 05:13:22 +01:00
			`memcpy(buf + nb, " +", 2);`
			`nb += 2;`

Clean-up trivially redundant diff. Also corrects the line numbers in unified output when using zero lines context. 2006-04-04 03:47:55 +02:00			`nb += xdl_num_out(buf + nb, c2 ? s2: s2 - 1);`
Use a real built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 05:13:22 +01:00
built-in diff: minimum tweaks This fixes up a couple of minor issues with the real built-in diff to be more usable: - Omit ---/+++ header unless we emit diff output; - Detect and punt binary diff like GNU does; - Honor GIT_DIFF_OPTS minimally (only -u<number> and --unified=<number> are currently supported); - Omit line count of 1 from "@@ -l,k +m,n @@" hunk header (i.e. when k == 1 or n == 1) - Adjust testsuite for the lack of -p support. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 21:16:17 +01:00			`if (c2 != 1) {`
			`memcpy(buf + nb, ",", 1);`
			`nb += 1;`
Use a real built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 05:13:22 +01:00
built-in diff: minimum tweaks This fixes up a couple of minor issues with the real built-in diff to be more usable: - Omit ---/+++ header unless we emit diff output; - Detect and punt binary diff like GNU does; - Honor GIT_DIFF_OPTS minimally (only -u<number> and --unified=<number> are currently supported); - Omit line count of 1 from "@@ -l,k +m,n @@" hunk header (i.e. when k == 1 or n == 1) - Adjust testsuite for the lack of -p support. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 21:16:17 +01:00			`nb += xdl_num_out(buf + nb, c2);`
			`}`
Use a real built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 05:13:22 +01:00
xdiff: Show function names in hunk headers. The speed of the built-in diff generator is nice; but the function names shown by `diff -p' are /really/ nice. And I hate having to choose. So, we hack xdiff to find the function names and print them. xdiff has grown a flag to say whether to dig up the function names. The builtin_diff function passes this flag unconditionally. I suppose it could parse GIT_DIFF_OPTS, but it doesn't at the moment. I've also reintroduced the `function name' into the test suite, from which it was removed in commit 3ce8f089. The function names are parsed by a particularly stupid algorithm at the moment: it just tries to find a line in the `old' file, from before the start of the hunk, whose first character looks plausible. Still, it's most definitely a start. Signed-off-by: Mark Wooding <mdw@distorted.org.uk> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-28 04:23:31 +02:00			`memcpy(buf + nb, " @@", 3);`
			`nb += 3;`
			`if (func && funclen) {`
			`buf[nb++] = ' ';`
			`if (funclen > sizeof(buf) - nb - 1)`
			`funclen = sizeof(buf) - nb - 1;`
			`memcpy(buf + nb, func, funclen);`
			`nb += funclen;`
			`}`
			`buf[nb++] = '\n';`
Use a real built-in diff generator This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". This has several huge advantages, for example: Before: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m24.818s user 0m13.332s sys 0m8.664s After: [torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null real 0m4.563s user 0m2.944s sys 0m1.580s and the fact that this should be a lot more portable (ie we can ignore all the issues with doing fork/execve under Windows). Perhaps even more importantly, this allows us to do diffs without actually ever writing out the git file contents to a temporary file (and without any of the shell quoting issues on filenames etc etc). NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the current "diff-core" code actually will always write the temp-files, because it used to be something that you simply had to do. So this current one actually writes a temp-file like before, and then reads it into memory again just to do the diff. Stupid. But if this basic infrastructure is accepted, we can start switching over diff-core to not write temp-files, which should speed things up even further, especially when doing big tree-to-tree diffs. Now, in the interest of full disclosure, I should also point out a few downsides: - the libxdiff algorithm is different, and I bet GNU diff has gotten a lot more testing. And the thing is, generating a diff is not an exact science - you can get two different diffs (and you will), and they can both be perfectly valid. So it's not possible to "validate" the libxdiff output by just comparing it against GNU diff. - GNU diff does some nice eye-candy, like trying to figure out what the last function was, and adding that information to the "@@ .." line. libxdiff doesn't do that. - The libxdiff thing has some known deficiencies. In particular, it gets the "\No newline at end of file" case wrong. So this is currently for the experimental branch only. I hope Davide will help fix it. That said, I think the huge performance advantage, and the fact that it integrates better is definitely worth it. But it should go into a development branch at least due to the missing newline issue. Technical note: this is based on libxdiff-0.17, but I did some surgery to get rid of the extraneous fat - stuff that git doesn't need, and seriously cutting down on mmfile_t, which had much more capabilities than the diff algorithm either needed or used. In this version, "mmfile_t" is just a trivial <pointer,length> tuple. That said, I tried to keep the differences to simple removals, so that you can do a diff between this and the libxdiff origin, and you'll basically see just things getting deleted. Even the mmfile_t simplifications are left in a state where the diffs should be readable. Apologies to Davide, whom I'd love to get feedback on this all from (I wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old complex format had a helper function for that, but I did my surgery with the goal in mind that eventually we _should_ just do mmfile_t mf; buf = read_sha1_file(sha1, type, &size); mf->ptr = buf; mf->size = size; .. use "mf" directly .. which was really a nightmare with the old "helpful" mmfile_t, and really is that easy with the new cut-down interfaces). [ Btw, as any hawk-eye can see from the diff, this was actually generated with itself, so it is "self-hosting". That's about all the testing it has gotten, along with the above kernel diff, which eye-balls correctly, but shows the newline issue when you double-check it with "git-apply" ] Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-25 05:13:22 +01:00
			`mb.ptr = buf;`
			`mb.size = nb;`
			`if (ecb->outf(ecb->priv, &mb, 1) < 0)`
			`return -1;`

			`return 0;`
			`}`
xdiff/xpatience: factor out fall-back-diff function This is in preparation for the histogram diff algorithm, which will also re-use much of the code to call the default Meyers diff algorithm. Signed-off-by: Tay Ray Chuan <rctay89@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-07-07 06:23:57 +02:00
			`int xdl_fall_back_diff(xdfenv_t diff_env, xpparam_t const xpp,`
			`int line1, int count1, int line2, int count2)`
			`{`
			`/*`
			`* This probably does not work outside Git, since`
			`* we have a very simple mmfile structure.`
			`*`
			`* Note: ideally, we would reuse the prepared environment, but`
			`* the libxdiff interface does not (yet) allow for diffing only`
			`* ranges of lines instead of the whole files.`
			`*/`
			`mmfile_t subfile1, subfile2;`
			`xdfenv_t env;`

			`subfile1.ptr = (char *)diff_env->xdf1.recs[line1 - 1]->ptr;`
			`subfile1.size = diff_env->xdf1.recs[line1 + count1 - 2]->ptr +`
			`diff_env->xdf1.recs[line1 + count1 - 2]->size - subfile1.ptr;`
			`subfile2.ptr = (char *)diff_env->xdf2.recs[line2 - 1]->ptr;`
			`subfile2.size = diff_env->xdf2.recs[line2 + count2 - 2]->ptr +`
			`diff_env->xdf2.recs[line2 + count2 - 2]->size - subfile2.ptr;`
			`if (xdl_do_diff(&subfile1, &subfile2, xpp, &env) < 0)`
			`return -1;`

			`memcpy(diff_env->xdf1.rchg + line1 - 1, env.xdf1.rchg, count1);`
			`memcpy(diff_env->xdf2.rchg + line2 - 1, env.xdf2.rchg, count2);`

			`xdl_free_env(&env);`

			`return 0;`
			`}`