mirrors/git - Incest Forge: Beyond sex. We incest.

mirrors/git

mirror of https://github.com/git/git.git synced 2024-11-16 06:03:44 +01:00

234 lines

5.4 KiB

C

Raw Normal View History

diffcore-rename: split out the delta counting code. This is to rework diffcore break/rename/copy detection code so that it does not affected when deltifier code gets improved. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-01 01:01:36 +01:00			`#include "cache.h"`
			`#include "diff.h"`
			`#include "diffcore.h"`
diffcore-delta: make change counter to byte oriented again. The textual line oriented change counter was fun but was not very effective. It tended to overcount the changes. This one changes it to a simple N-letter substring based implementation. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-04 12:21:55 +01:00
			`/*`
			`* Idea here is very simple.`
			`*`
diffcore-delta.c: update the comment on the algorithm. The comment at the top of the file described an old algorithm that was neutral to text/binary differences (it hashed sliding window of N-byte sequences and counted overlaps), but long time ago we switched to a new heuristics that are more suitable for line oriented (read: text) files that are much faster. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-06-29 08:11:40 +02:00			`* Almost all data we are interested in are text, but sometimes we have`
			`* to deal with binary data. So we cut them into chunks delimited by`
			`* LF byte, or 64-byte sequence, whichever comes first, and hash them.`
diffcore-delta: make change counter to byte oriented again. The textual line oriented change counter was fun but was not very effective. It tended to overcount the changes. This one changes it to a simple N-letter substring based implementation. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-04 12:21:55 +01:00			`*`
diffcore-delta.c: update the comment on the algorithm. The comment at the top of the file described an old algorithm that was neutral to text/binary differences (it hashed sliding window of N-byte sequences and counted overlaps), but long time ago we switched to a new heuristics that are more suitable for line oriented (read: text) files that are much faster. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-06-29 08:11:40 +02:00			`* For those chunks, if the source buffer has more instances of it`
			`* than the destination buffer, that means the difference are the`
			`* number of bytes not copied from source to destination. If the`
			`* counts are the same, everything was copied from source to`
			`* destination. If the destination has more, everything was copied,`
			`* and destination added more.`
diffcore-delta: make change counter to byte oriented again. The textual line oriented change counter was fun but was not very effective. It tended to overcount the changes. This one changes it to a simple N-letter substring based implementation. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-04 12:21:55 +01:00			`*`
			`* We are doing an approximation so we do not really have to waste`
			`* memory by actually storing the sequence. We just hash them into`
			`* somewhere around 2^16 hashbuckets and count the occurrences.`
			`*/`

diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00			`/* Wild guess at the initial hash size */`
diffcore-delta: make the hash a bit denser. To reduce wasted memory, wait until the hash fills up more densely before we rehash. This reduces the working set size a bit further. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-13 01:39:51 +01:00			`#define INITIAL_HASH_SIZE 9`
diffcore-delta: tweak hashbase value. This tweaks the maximum hashvalue we use to hash the string into without making the maximum size of the hashtable can grow from the current limit. With this, the renames detected becomes a bit more precise without incurring additional paging cost. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-13 05:32:06 +01:00
diffcore-delta: make the hash a bit denser. To reduce wasted memory, wait until the hash fills up more densely before we rehash. This reduces the working set size a bit further. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-13 01:39:51 +01:00			`/* We leave more room in smaller hash but do not let it`
			`* grow to have unused hole too much.`
			`*/`
			`#define INITIAL_FREE(sz_log2) ((1<<(sz_log2))*(sz_log2-3)/(sz_log2))`
diffcore-delta: make change counter to byte oriented again. The textual line oriented change counter was fun but was not very effective. It tended to overcount the changes. This one changes it to a simple N-letter substring based implementation. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-04 12:21:55 +01:00
diffcore-delta: tweak hashbase value. This tweaks the maximum hashvalue we use to hash the string into without making the maximum size of the hashtable can grow from the current limit. With this, the renames detected becomes a bit more precise without incurring additional paging cost. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-13 05:32:06 +01:00			`/* A prime rather carefully chosen between 2^16..2^17, so that`
			`* HASHBASE < INITIAL_FREE(17). We want to keep the maximum hashtable`
			`* size under the current 2<<17 maximum, which can hold this many`
			`* different values before overflowing to hashtable of size 2<<18.`
			`*/`
			`#define HASHBASE 107927`

diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00			`struct spanhash {`
diffcore-delta: 64-byte-or-EOL ultrafast replacement (hash fix). The rotating 64-bit number was not really rotating, and worse yet ulong was longer than 64-bit on 64-bit architectures X-<. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-15 09:37:57 +01:00			`unsigned int hashval;`
			`unsigned int cnt;`
diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00			`};`
			`struct spanhash_top {`
			`int alloc_log2;`
			`int free;`
			`struct spanhash data[FLEX_ARRAY];`
			`};`

			`static struct spanhash_top spanhash_rehash(struct spanhash_top orig)`
			`{`
			`struct spanhash_top *new;`
			`int i;`
			`int osz = 1 << orig->alloc_log2;`
			`int sz = osz << 1;`

			`new = xmalloc(sizeof(orig) + sizeof(struct spanhash) sz);`
			`new->alloc_log2 = orig->alloc_log2 + 1;`
diffcore-delta: make the hash a bit denser. To reduce wasted memory, wait until the hash fills up more densely before we rehash. This reduces the working set size a bit further. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-13 01:39:51 +01:00			`new->free = INITIAL_FREE(new->alloc_log2);`
diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00			`memset(new->data, 0, sizeof(struct spanhash) * sz);`
			`for (i = 0; i < osz; i++) {`
			`struct spanhash *o = &(orig->data[i]);`
			`int bucket;`
			`if (!o->cnt)`
			`continue;`
			`bucket = o->hashval & (sz - 1);`
			`while (1) {`
			`struct spanhash *h = &(new->data[bucket++]);`
			`if (!h->cnt) {`
			`h->hashval = o->hashval;`
			`h->cnt = o->cnt;`
			`new->free--;`
			`break;`
			`}`
			`if (sz <= bucket)`
			`bucket = 0;`
			`}`
			`}`
			`free(orig);`
			`return new;`
			`}`

			`static struct spanhash_top add_spanhash(struct spanhash_top top,`
diffcore-delta: 64-byte-or-EOL ultrafast replacement (hash fix). The rotating 64-bit number was not really rotating, and worse yet ulong was longer than 64-bit on 64-bit architectures X-<. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-15 09:37:57 +01:00			`unsigned int hashval, int cnt)`
diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00			`{`
			`int bucket, lim;`
			`struct spanhash *h;`

			`lim = (1 << top->alloc_log2);`
			`bucket = hashval & (lim - 1);`
			`while (1) {`
			`h = &(top->data[bucket++]);`
			`if (!h->cnt) {`
			`h->hashval = hashval;`
diffcore-delta: 64-byte-or-EOL ultrafast replacement. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-15 09:37:57 +01:00			`h->cnt = cnt;`
diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00			`top->free--;`
			`if (top->free < 0)`
			`return spanhash_rehash(top);`
			`return top;`
			`}`
			`if (h->hashval == hashval) {`
diffcore-delta: 64-byte-or-EOL ultrafast replacement. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-15 09:37:57 +01:00			`h->cnt += cnt;`
diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00			`return top;`
			`}`
			`if (lim <= bucket)`
			`bucket = 0;`
			`}`
			`}`

optimize diffcore-delta by sorting hash entries. Here's a test-patch. I don't guarantee anything, except that when I did the timings I also did a "wc" on the result, and they matched.. Before: [torvalds@woody linux]$ time git diff -l0 --stat -C v2.6.22.. \| wc 7104 28574 438020 real 0m10.526s user 0m10.401s sys 0m0.136s After: [torvalds@woody linux]$ time ~/git/git diff -l0 --stat -C v2.6.22.. \| wc 7104 28574 438020 real 0m8.876s user 0m8.761s sys 0m0.128s but the diff is fairly simple, so if somebody will go over it and say whether it's likely to be correct too, that 15% may well be worth it. [ Side note, without rename detection, that diff takes just under three seconds for me, so in that sense the improvement to the rename detection itself is larger than the overall 15% - it brings the cost of just rename detection from 7.5s to 5.9s, which would be on the order of just over a 20% performance improvement. ] Hmm. The patch depends on half-way subtle issues like the fact that the hashtables are guaranteed to not be full => we're guaranteed to have zero counts at the end => we don't need to do any steenking iterator count in the loop. A few comments might in order. Linus 2007-10-03 04:28:19 +02:00			`static int spanhash_cmp(const void a_, const void b_)`
			`{`
			`const struct spanhash *a = a_;`
			`const struct spanhash *b = b_;`

			`/* A count of zero compares at the end.. */`
			`if (!a->cnt)`
			`return !b->cnt ? 0 : 1;`
			`if (!b->cnt)`
			`return -1;`
			`return a->hashval < b->hashval ? -1 :`
			`a->hashval > b->hashval ? 1 : 0;`
			`}`

diffcore-delta.c: Ignore CR in CRLF for text files This ignores CR byte in CRLF sequence in text file when computing similarity of two blobs. Usually this should not matter as nobody sane would be checking in a file with CRLF line endings to the repository (they would use autocrlf so that the repository copy would have LF line endings). Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-06-29 08:14:13 +02:00			`static struct spanhash_top hash_chars(struct diff_filespec one)`
diffcore-rename: split out the delta counting code. This is to rework diffcore break/rename/copy detection code so that it does not affected when deltifier code gets improved. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-01 01:01:36 +01:00			`{`
diffcore-delta: 64-byte-or-EOL ultrafast replacement. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-15 09:37:57 +01:00			`int i, n;`
diffcore-delta: 64-byte-or-EOL ultrafast replacement (hash fix). The rotating 64-bit number was not really rotating, and worse yet ulong was longer than 64-bit on 64-bit architectures X-<. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-15 09:37:57 +01:00			`unsigned int accum1, accum2, hashval;`
diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00			`struct spanhash_top *hash;`
diffcore-delta.c: Ignore CR in CRLF for text files This ignores CR byte in CRLF sequence in text file when computing similarity of two blobs. Usually this should not matter as nobody sane would be checking in a file with CRLF line endings to the repository (they would use autocrlf so that the repository copy would have LF line endings). Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-06-29 08:14:13 +02:00			`unsigned char *buf = one->data;`
			`unsigned int sz = one->size;`
Introduce diff_filespec_is_binary() This replaces an explicit initialization of filespec->is_binary field used for rename/break followed by direct access to that field with a wrapper function that lazily iniaitlizes and accesses the field. We would add more attribute accesses for the use of diff routines, and it would be better to make this abstraction earlier. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-07-06 09:18:54 +02:00			`int is_text = !diff_filespec_is_binary(one);`
diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00
			`i = INITIAL_HASH_SIZE;`
			`hash = xmalloc(sizeof(hash) + sizeof(struct spanhash) (1<<i));`
			`hash->alloc_log2 = i;`
diffcore-delta: make the hash a bit denser. To reduce wasted memory, wait until the hash fills up more densely before we rehash. This reduces the working set size a bit further. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-13 01:39:51 +01:00			`hash->free = INITIAL_FREE(i);`
diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00			`memset(hash->data, 0, sizeof(struct spanhash) * (1<<i));`
diffcore-rename: split out the delta counting code. This is to rework diffcore break/rename/copy detection code so that it does not affected when deltifier code gets improved. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-01 01:01:36 +01:00
diffcore-delta: 64-byte-or-EOL ultrafast replacement. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-15 09:37:57 +01:00			`n = 0;`
			`accum1 = accum2 = 0;`
diffcore-delta: make change counter to byte oriented again. The textual line oriented change counter was fun but was not very effective. It tended to overcount the changes. This one changes it to a simple N-letter substring based implementation. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-04 12:21:55 +01:00			`while (sz) {`
diffcore-delta: 64-byte-or-EOL ultrafast replacement (hash fix). The rotating 64-bit number was not really rotating, and worse yet ulong was longer than 64-bit on 64-bit architectures X-<. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-15 09:37:57 +01:00			`unsigned int c = *buf++;`
			`unsigned int old_1 = accum1;`
diffcore-delta: make change counter to byte oriented again. The textual line oriented change counter was fun but was not very effective. It tended to overcount the changes. This one changes it to a simple N-letter substring based implementation. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-04 12:21:55 +01:00			`sz--;`
diffcore-delta.c: Ignore CR in CRLF for text files This ignores CR byte in CRLF sequence in text file when computing similarity of two blobs. Usually this should not matter as nobody sane would be checking in a file with CRLF line endings to the repository (they would use autocrlf so that the repository copy would have LF line endings). Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-06-29 08:14:13 +02:00
			`/* Ignore CR in CRLF sequence if text */`
			`if (is_text && c == '\r' && sz && *buf == '\n')`
			`continue;`

diffcore-delta: 64-byte-or-EOL ultrafast replacement (hash fix). The rotating 64-bit number was not really rotating, and worse yet ulong was longer than 64-bit on 64-bit architectures X-<. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-15 09:37:57 +01:00			`accum1 = (accum1 << 7) ^ (accum2 >> 25);`
			`accum2 = (accum2 << 7) ^ (old_1 >> 25);`
diffcore-delta: 64-byte-or-EOL ultrafast replacement. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-15 09:37:57 +01:00			`accum1 += c;`
			`if (++n < 64 && c != '\n')`
			`continue;`
			`hashval = (accum1 + accum2 * 0x61) % HASHBASE;`
			`hash = add_spanhash(hash, hashval, n);`
			`n = 0;`
			`accum1 = accum2 = 0;`
diffcore-rename: split out the delta counting code. This is to rework diffcore break/rename/copy detection code so that it does not affected when deltifier code gets improved. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-01 01:01:36 +01:00			`}`
optimize diffcore-delta by sorting hash entries. Here's a test-patch. I don't guarantee anything, except that when I did the timings I also did a "wc" on the result, and they matched.. Before: [torvalds@woody linux]$ time git diff -l0 --stat -C v2.6.22.. \| wc 7104 28574 438020 real 0m10.526s user 0m10.401s sys 0m0.136s After: [torvalds@woody linux]$ time ~/git/git diff -l0 --stat -C v2.6.22.. \| wc 7104 28574 438020 real 0m8.876s user 0m8.761s sys 0m0.128s but the diff is fairly simple, so if somebody will go over it and say whether it's likely to be correct too, that 15% may well be worth it. [ Side note, without rename detection, that diff takes just under three seconds for me, so in that sense the improvement to the rename detection itself is larger than the overall 15% - it brings the cost of just rename detection from 7.5s to 5.9s, which would be on the order of just over a 20% performance improvement. ] Hmm. The patch depends on half-way subtle issues like the fact that the hashtables are guaranteed to not be full => we're guaranteed to have zero counts at the end => we don't need to do any steenking iterator count in the loop. A few comments might in order. Linus 2007-10-03 04:28:19 +02:00			`qsort(hash->data,`
			`1ul << hash->alloc_log2,`
			`sizeof(hash->data[0]),`
			`spanhash_cmp);`
diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00			`return hash;`
diffcore-rename: split out the delta counting code. This is to rework diffcore break/rename/copy detection code so that it does not affected when deltifier code gets improved. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-01 01:01:36 +01:00			`}`

diffcore_count_changes: pass diffcore_filespec We may want to use richer information on the data we are dealing with in this function, so instead of passing a buffer address and length, just pass the diffcore_filespec structure. Existing callers always call this function with parameters taken from a filespec anyway, so there is no functionality changes. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-06-29 07:54:37 +02:00			`int diffcore_count_changes(struct diff_filespec *src,`
			`struct diff_filespec *dst,`
diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00			`void **src_count_p,`
			`void **dst_count_p,`
diffcore-rename: split out the delta counting code. This is to rework diffcore break/rename/copy detection code so that it does not affected when deltifier code gets improved. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-01 01:01:36 +01:00			`unsigned long delta_limit,`
			`unsigned long *src_copied,`
			`unsigned long *literal_added)`
			`{`
optimize diffcore-delta by sorting hash entries. Here's a test-patch. I don't guarantee anything, except that when I did the timings I also did a "wc" on the result, and they matched.. Before: [torvalds@woody linux]$ time git diff -l0 --stat -C v2.6.22.. \| wc 7104 28574 438020 real 0m10.526s user 0m10.401s sys 0m0.136s After: [torvalds@woody linux]$ time ~/git/git diff -l0 --stat -C v2.6.22.. \| wc 7104 28574 438020 real 0m8.876s user 0m8.761s sys 0m0.128s but the diff is fairly simple, so if somebody will go over it and say whether it's likely to be correct too, that 15% may well be worth it. [ Side note, without rename detection, that diff takes just under three seconds for me, so in that sense the improvement to the rename detection itself is larger than the overall 15% - it brings the cost of just rename detection from 7.5s to 5.9s, which would be on the order of just over a 20% performance improvement. ] Hmm. The patch depends on half-way subtle issues like the fact that the hashtables are guaranteed to not be full => we're guaranteed to have zero counts at the end => we don't need to do any steenking iterator count in the loop. A few comments might in order. Linus 2007-10-03 04:28:19 +02:00			`struct spanhash s, d;`
diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00			`struct spanhash_top src_count, dst_count;`
diffcore-delta: make change counter to byte oriented again. The textual line oriented change counter was fun but was not very effective. It tended to overcount the changes. This one changes it to a simple N-letter substring based implementation. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-04 12:21:55 +01:00			`unsigned long sc, la;`

diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00			`src_count = dst_count = NULL;`
			`if (src_count_p)`
			`src_count = *src_count_p;`
			`if (!src_count) {`
diffcore-delta.c: Ignore CR in CRLF for text files This ignores CR byte in CRLF sequence in text file when computing similarity of two blobs. Usually this should not matter as nobody sane would be checking in a file with CRLF line endings to the repository (they would use autocrlf so that the repository copy would have LF line endings). Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-06-29 08:14:13 +02:00			`src_count = hash_chars(src);`
diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00			`if (src_count_p)`
			`*src_count_p = src_count;`
			`}`
			`if (dst_count_p)`
			`dst_count = *dst_count_p;`
			`if (!dst_count) {`
diffcore-delta.c: Ignore CR in CRLF for text files This ignores CR byte in CRLF sequence in text file when computing similarity of two blobs. Usually this should not matter as nobody sane would be checking in a file with CRLF line endings to the repository (they would use autocrlf so that the repository copy would have LF line endings). Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-06-29 08:14:13 +02:00			`dst_count = hash_chars(dst);`
diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00			`if (dst_count_p)`
			`*dst_count_p = dst_count;`
			`}`
diffcore-delta: make change counter to byte oriented again. The textual line oriented change counter was fun but was not very effective. It tended to overcount the changes. This one changes it to a simple N-letter substring based implementation. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-04 12:21:55 +01:00			`sc = la = 0;`
diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00
optimize diffcore-delta by sorting hash entries. Here's a test-patch. I don't guarantee anything, except that when I did the timings I also did a "wc" on the result, and they matched.. Before: [torvalds@woody linux]$ time git diff -l0 --stat -C v2.6.22.. \| wc 7104 28574 438020 real 0m10.526s user 0m10.401s sys 0m0.136s After: [torvalds@woody linux]$ time ~/git/git diff -l0 --stat -C v2.6.22.. \| wc 7104 28574 438020 real 0m8.876s user 0m8.761s sys 0m0.128s but the diff is fairly simple, so if somebody will go over it and say whether it's likely to be correct too, that 15% may well be worth it. [ Side note, without rename detection, that diff takes just under three seconds for me, so in that sense the improvement to the rename detection itself is larger than the overall 15% - it brings the cost of just rename detection from 7.5s to 5.9s, which would be on the order of just over a 20% performance improvement. ] Hmm. The patch depends on half-way subtle issues like the fact that the hashtables are guaranteed to not be full => we're guaranteed to have zero counts at the end => we don't need to do any steenking iterator count in the loop. A few comments might in order. Linus 2007-10-03 04:28:19 +02:00			`s = src_count->data;`
			`d = dst_count->data;`
			`for (;;) {`
diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00			`unsigned dst_cnt, src_cnt;`
			`if (!s->cnt)`
optimize diffcore-delta by sorting hash entries. Here's a test-patch. I don't guarantee anything, except that when I did the timings I also did a "wc" on the result, and they matched.. Before: [torvalds@woody linux]$ time git diff -l0 --stat -C v2.6.22.. \| wc 7104 28574 438020 real 0m10.526s user 0m10.401s sys 0m0.136s After: [torvalds@woody linux]$ time ~/git/git diff -l0 --stat -C v2.6.22.. \| wc 7104 28574 438020 real 0m8.876s user 0m8.761s sys 0m0.128s but the diff is fairly simple, so if somebody will go over it and say whether it's likely to be correct too, that 15% may well be worth it. [ Side note, without rename detection, that diff takes just under three seconds for me, so in that sense the improvement to the rename detection itself is larger than the overall 15% - it brings the cost of just rename detection from 7.5s to 5.9s, which would be on the order of just over a 20% performance improvement. ] Hmm. The patch depends on half-way subtle issues like the fact that the hashtables are guaranteed to not be full => we're guaranteed to have zero counts at the end => we don't need to do any steenking iterator count in the loop. A few comments might in order. Linus 2007-10-03 04:28:19 +02:00			`break; /* we checked all in src */`
			`while (d->cnt) {`
			`if (d->hashval >= s->hashval)`
			`break;`
Fix diff -B/--dirstat miscounting of newly added contents What used to happen is that diffcore_count_changes() simply ignored any hashes in the destination that didn't match hashes in the source. EXCEPT if the source hash didn't exist at all, in which case it would count _one_ destination hash that happened to have the "next" hash value. As a consequence, newly added material was often undercounted, making output from --dirstat and "complete rewrite" detection used by -B unrelialble. This changes it so that: - whenever it bypasses a destination hash (because it doesn't match a source), it counts the bytes associated with that as "literal added" - at the end (once we have used up all the source hashes), we do the same thing with the remaining destination hashes. - when hashes do match, and we use the difference in counts as a value, we also use up that destination hash entry (the 'd++'). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-12-04 21:07:47 +01:00			`la += d->cnt;`
optimize diffcore-delta by sorting hash entries. Here's a test-patch. I don't guarantee anything, except that when I did the timings I also did a "wc" on the result, and they matched.. Before: [torvalds@woody linux]$ time git diff -l0 --stat -C v2.6.22.. \| wc 7104 28574 438020 real 0m10.526s user 0m10.401s sys 0m0.136s After: [torvalds@woody linux]$ time ~/git/git diff -l0 --stat -C v2.6.22.. \| wc 7104 28574 438020 real 0m8.876s user 0m8.761s sys 0m0.128s but the diff is fairly simple, so if somebody will go over it and say whether it's likely to be correct too, that 15% may well be worth it. [ Side note, without rename detection, that diff takes just under three seconds for me, so in that sense the improvement to the rename detection itself is larger than the overall 15% - it brings the cost of just rename detection from 7.5s to 5.9s, which would be on the order of just over a 20% performance improvement. ] Hmm. The patch depends on half-way subtle issues like the fact that the hashtables are guaranteed to not be full => we're guaranteed to have zero counts at the end => we don't need to do any steenking iterator count in the loop. A few comments might in order. Linus 2007-10-03 04:28:19 +02:00			`d++;`
			`}`
diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00			`src_cnt = s->cnt;`
Fix diff -B/--dirstat miscounting of newly added contents What used to happen is that diffcore_count_changes() simply ignored any hashes in the destination that didn't match hashes in the source. EXCEPT if the source hash didn't exist at all, in which case it would count _one_ destination hash that happened to have the "next" hash value. As a consequence, newly added material was often undercounted, making output from --dirstat and "complete rewrite" detection used by -B unrelialble. This changes it so that: - whenever it bypasses a destination hash (because it doesn't match a source), it counts the bytes associated with that as "literal added" - at the end (once we have used up all the source hashes), we do the same thing with the remaining destination hashes. - when hashes do match, and we use the difference in counts as a value, we also use up that destination hash entry (the 'd++'). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-12-04 21:07:47 +01:00			`dst_cnt = 0;`
			`if (d->cnt && d->hashval == s->hashval) {`
			`dst_cnt = d->cnt;`
			`d++;`
			`}`
diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00			`if (src_cnt < dst_cnt) {`
			`la += dst_cnt - src_cnt;`
			`sc += src_cnt;`
diffcore-delta: make change counter to byte oriented again. The textual line oriented change counter was fun but was not very effective. It tended to overcount the changes. This one changes it to a simple N-letter substring based implementation. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-04 12:21:55 +01:00			`}`
diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00			`else`
			`sc += dst_cnt;`
optimize diffcore-delta by sorting hash entries. Here's a test-patch. I don't guarantee anything, except that when I did the timings I also did a "wc" on the result, and they matched.. Before: [torvalds@woody linux]$ time git diff -l0 --stat -C v2.6.22.. \| wc 7104 28574 438020 real 0m10.526s user 0m10.401s sys 0m0.136s After: [torvalds@woody linux]$ time ~/git/git diff -l0 --stat -C v2.6.22.. \| wc 7104 28574 438020 real 0m8.876s user 0m8.761s sys 0m0.128s but the diff is fairly simple, so if somebody will go over it and say whether it's likely to be correct too, that 15% may well be worth it. [ Side note, without rename detection, that diff takes just under three seconds for me, so in that sense the improvement to the rename detection itself is larger than the overall 15% - it brings the cost of just rename detection from 7.5s to 5.9s, which would be on the order of just over a 20% performance improvement. ] Hmm. The patch depends on half-way subtle issues like the fact that the hashtables are guaranteed to not be full => we're guaranteed to have zero counts at the end => we don't need to do any steenking iterator count in the loop. A few comments might in order. Linus 2007-10-03 04:28:19 +02:00			`s++;`
diffcore-delta: make change counter to byte oriented again. The textual line oriented change counter was fun but was not very effective. It tended to overcount the changes. This one changes it to a simple N-letter substring based implementation. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-04 12:21:55 +01:00			`}`
Fix diff -B/--dirstat miscounting of newly added contents What used to happen is that diffcore_count_changes() simply ignored any hashes in the destination that didn't match hashes in the source. EXCEPT if the source hash didn't exist at all, in which case it would count _one_ destination hash that happened to have the "next" hash value. As a consequence, newly added material was often undercounted, making output from --dirstat and "complete rewrite" detection used by -B unrelialble. This changes it so that: - whenever it bypasses a destination hash (because it doesn't match a source), it counts the bytes associated with that as "literal added" - at the end (once we have used up all the source hashes), we do the same thing with the remaining destination hashes. - when hashes do match, and we use the difference in counts as a value, we also use up that destination hash entry (the 'd++'). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-12-04 21:07:47 +01:00			`while (d->cnt) {`
			`la += d->cnt;`
			`d++;`
			`}`
diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00
			`if (!src_count_p)`
			`free(src_count);`
			`if (!dst_count_p)`
			`free(dst_count);`
diffcore-delta: make change counter to byte oriented again. The textual line oriented change counter was fun but was not very effective. It tended to overcount the changes. This one changes it to a simple N-letter substring based implementation. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-04 12:21:55 +01:00			`*src_copied = sc;`
			`*literal_added = la;`
			`return 0;`
diffcore-rename: split out the delta counting code. This is to rework diffcore break/rename/copy detection code so that it does not affected when deltifier code gets improved. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-01 01:01:36 +01:00			`}`