mirrors/git - Incest Forge: Beyond sex. We incest.

mirrors/git

mirror of https://github.com/git/git.git synced 2024-11-14 13:13:01 +01:00

726 lines

19 KiB

C

Raw Normal View History

[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`/*`
			`* Copyright (C) 2005 Junio C Hamano`
			`*/`
			`#include "cache.h"`
			`#include "diff.h"`
			`#include "diffcore.h"`
Do linear-time/space rename logic for exact renames This implements a smarter rename detector for exact renames, which rather than doing a pairwise comparison (time O(m*n)) will just hash the files into a hash-table (size O(n+m)), and only do pairwise comparisons to renames that have the same hash (time O(n+m) except for unrealistic hash collissions, which we just cull aggressively). Admittedly the exact rename case is not nearly as interesting as the generic case, but it's an important case none-the-less. A similar general approach should work for the generic case too, but even then you do need to handle the exact renames/copies separately (to avoid the inevitable added cost factor that comes from the _size_ of the file), so this is worth doing. In the expectation that we will indeed do the same hashing trick for the general rename case, this code uses a generic hash-table implementation that can be used for other things too. In fact, we might be able to consolidate some of our existing hash tables with the new generic code in hash.[ch]. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:23:26 +02:00			`#include "hash.h"`
add inexact rename detection progress infrastructure We might spend many seconds doing inexact rename detection with no output. It's nice to let the user know that something is actually happening. This patch adds the infrastructure, but no callers actually turn on progress reporting. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-02-20 10:51:16 +01:00			`#include "progress.h"`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`/* Table of rename/copy destinations */`

			`static struct diff_rename_dst {`
			`struct diff_filespec *two;`
			`struct diff_filepair *pair;`
			`} *rename_dst;`
			`static int rename_dst_nr, rename_dst_alloc;`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`static struct diff_rename_dst locate_rename_dst(struct diff_filespec two,`
			`int insert_ok)`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`{`
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`int first, last;`

			`first = 0;`
			`last = rename_dst_nr;`
			`while (last > first) {`
			`int next = (last + first) >> 1;`
			`struct diff_rename_dst *dst = &(rename_dst[next]);`
			`int cmp = strcmp(two->path, dst->two->path);`
			`if (!cmp)`
			`return dst;`
			`if (cmp < 0) {`
			`last = next;`
			`continue;`
			`}`
			`first = next+1;`
			`}`
			`/* not found */`
			`if (!insert_ok)`
			`return NULL;`
			`/* insert to make it at "first" */`
			`if (rename_dst_alloc <= rename_dst_nr) {`
			`rename_dst_alloc = alloc_nr(rename_dst_alloc);`
			`rename_dst = xrealloc(rename_dst,`
			`rename_dst_alloc * sizeof(*rename_dst));`
			`}`
			`rename_dst_nr++;`
			`if (first < rename_dst_nr)`
			`memmove(rename_dst + first + 1, rename_dst + first,`
			`(rename_dst_nr - first - 1) * sizeof(*rename_dst));`
Plug diff leaks. It is a bit embarrassing that it took this long for a fix since the problem was first reported on Aug 13th. Message-ID: <87y876gl1r.wl@mail2.atmark-techno.com> From: Yasushi SHOJI <yashi@atmark-techno.com> Newsgroups: gmane.comp.version-control.git Subject: [patch] possible memory leak in diff.c::diff_free_filepair() Date: Sat, 13 Aug 2005 19:58:56 +0900 This time I used valgrind to make sure that it does not overeagerly discard memory that is still being used. Signed-off-by: Junio C Hamano <junkio@cox.net> 2005-09-16 01:13:43 +02:00			`rename_dst[first].two = alloc_filespec(two->path);`
diff: do not use null sha1 as a sentinel value The diff code represents paths using the diff_filespec struct. This struct has a sha1 to represent the sha1 of the content at that path, as well as a sha1_valid member which indicates whether its sha1 field is actually useful. If sha1_valid is not true, then the filespec represents a working tree file (e.g., for the no-index case, or for when the index is not up-to-date). The diff_filespec is only used internally, though. At the interfaces to the diff subsystem, callers feed the sha1 directly, and we create a diff_filespec from it. It's at that point that we look at the sha1 and decide whether it is valid or not; callers may pass the null sha1 as a sentinel value to indicate that it is not. We should not typically see the null sha1 coming from any other source (e.g., in the index itself, or from a tree). However, a corrupt tree might have a null sha1, which would cause "diff --patch" to accidentally diff the working tree version of a file instead of treating it as a blob. This patch extends the edges of the diff interface to accept a "sha1_valid" flag whenever we accept a sha1, and to use that flag when creating a filespec. In some cases, this means passing the flag through several layers, making the code change larger than would be desirable. One alternative would be to simply die() upon seeing corrupted trees with null sha1s. However, this fix more directly addresses the problem (while bogus sha1s in a tree are probably a bad thing, it is really the sentinel confusion sending us down the wrong code path that is what makes it devastating). And it means that git is more capable of examining and debugging these corrupted trees. For example, you can still "diff --raw" such a tree to find out when the bogus entry was introduced; you just cannot do a "--patch" diff (just as you could not with any other corrupted tree, as we do not have any content to diff). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2012-07-28 17:03:01 +02:00			`fill_filespec(rename_dst[first].two, two->sha1, two->sha1_valid, two->mode);`
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`rename_dst[first].pair = NULL;`
			`return &(rename_dst[first]);`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`}`

[PATCH] Fix the way diffcore-rename records unremoved source. Earier version of diffcore-rename used to keep unmodified filepair in its output so that the last stage of the processing that tells renames from copies can make all of rename/copy to copies. However this had a bad interaction with other diffcore filters that wanted to run after diffcore-rename, in that such unmodified filepair must be retained for proper distinction between renames and copies to happen. This patch fixes the problem by changing the way diffcore-rename records the information needed to distinguish "all are copies" case and "the last one is a rename" case. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-28 00:55:55 +02:00			`/* Table of rename/copy src files */`
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`static struct diff_rename_src {`
diffcore-rename: record filepair for rename src This will allow us to later skip unmodified entries added due to "-C -C". We might also want to do something similar to rename_dst side, but that would only be for the sake of symmetry and not necessary for this series. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-01-06 22:50:05 +01:00			`struct diff_filepair *p;`
diffcore-rename: fix merging back a broken pair. When a broken pair is matched up by rename detector to be merged back, we do not want to say it is "dissimilar" with the similarity index. The output should just say they were changed, taking the break score left by the earlier diffcore-break run if any. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-04-09 05:17:46 +02:00			`unsigned short score; /* to remember the break score */`
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`} *rename_src;`
			`static int rename_src_nr, rename_src_alloc;`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00
diffcore-rename: record filepair for rename src This will allow us to later skip unmodified entries added due to "-C -C". We might also want to do something similar to rename_dst side, but that would only be for the sake of symmetry and not necessary for this series. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-01-06 22:50:05 +01:00			`static struct diff_rename_src register_rename_src(struct diff_filepair p)`
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`{`
			`int first, last;`
diffcore-rename: record filepair for rename src This will allow us to later skip unmodified entries added due to "-C -C". We might also want to do something similar to rename_dst side, but that would only be for the sake of symmetry and not necessary for this series. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-01-06 22:50:05 +01:00			`struct diff_filespec *one = p->one;`
			`unsigned short score = p->score;`
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00
			`first = 0;`
			`last = rename_src_nr;`
			`while (last > first) {`
			`int next = (last + first) >> 1;`
			`struct diff_rename_src *src = &(rename_src[next]);`
diffcore-rename: record filepair for rename src This will allow us to later skip unmodified entries added due to "-C -C". We might also want to do something similar to rename_dst side, but that would only be for the sake of symmetry and not necessary for this series. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-01-06 22:50:05 +01:00			`int cmp = strcmp(one->path, src->p->one->path);`
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`if (!cmp)`
			`return src;`
			`if (cmp < 0) {`
			`last = next;`
			`continue;`
			`}`
			`first = next+1;`
			`}`
[PATCH] Fix the way diffcore-rename records unremoved source. Earier version of diffcore-rename used to keep unmodified filepair in its output so that the last stage of the processing that tells renames from copies can make all of rename/copy to copies. However this had a bad interaction with other diffcore filters that wanted to run after diffcore-rename, in that such unmodified filepair must be retained for proper distinction between renames and copies to happen. This patch fixes the problem by changing the way diffcore-rename records the information needed to distinguish "all are copies" case and "the last one is a rename" case. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-28 00:55:55 +02:00
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`/* insert to make it at "first" */`
			`if (rename_src_alloc <= rename_src_nr) {`
			`rename_src_alloc = alloc_nr(rename_src_alloc);`
			`rename_src = xrealloc(rename_src,`
			`rename_src_alloc * sizeof(*rename_src));`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`}`
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`rename_src_nr++;`
			`if (first < rename_src_nr)`
			`memmove(rename_src + first + 1, rename_src + first,`
			`(rename_src_nr - first - 1) * sizeof(*rename_src));`
diffcore-rename: record filepair for rename src This will allow us to later skip unmodified entries added due to "-C -C". We might also want to do something similar to rename_dst side, but that would only be for the sake of symmetry and not necessary for this series. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-01-06 22:50:05 +01:00			`rename_src[first].p = p;`
diffcore-rename: fix merging back a broken pair. When a broken pair is matched up by rename detector to be merged back, we do not want to say it is "dissimilar" with the similarity index. The output should just say they were changed, taking the break score left by the earlier diffcore-break run if any. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-04-09 05:17:46 +02:00			`rename_src[first].score = score;`
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`return &(rename_src[first]);`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`}`

diffcore-rename: favour identical basenames When there are several candidates for a rename source, and one of them has an identical basename to the rename target, take that one. Noticed by Govind Salinas, posted by Shawn O. Pearce, partial patch by Linus Torvalds. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-06-21 13:52:11 +02:00			`static int basename_same(struct diff_filespec src, struct diff_filespec dst)`
			`{`
			`int src_len = strlen(src->path), dst_len = strlen(dst->path);`
			`while (src_len && dst_len) {`
			`char c1 = src->path[--src_len];`
			`char c2 = dst->path[--dst_len];`
			`if (c1 != c2)`
			`return 0;`
			`if (c1 == '/')`
			`return 1;`
			`}`
			`return (!src_len \|\| src->path[src_len - 1] == '/') &&`
			`(!dst_len \|\| dst->path[dst_len - 1] == '/');`
			`}`

[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`struct diff_score {`
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`int src; /* index in rename_src */`
			`int dst; /* index in rename_dst */`
Optimize rename detection for a huge diff When there are N deleted paths and M created paths, we used to allocate (N x M) "struct diff_score" that record how similar each of the pair is, and picked the <src,dst> pair that gives the best match first, and then went on to process worse matches. This sorting is done so that when two new files in the postimage that are similar to the same file deleted from the preimage, we can process the more similar one first, and when processing the second one, it can notice "Ah, the source I was planning to say I am a copy of is already taken by somebody else" and continue on to match itself with another file in the preimage with a lessor match. This matters to a change introduced between 1.5.3.X series and 1.5.4-rc, that lets the code to favor unused matches first and then falls back to using already used matches. This instead allocates and keeps only a handful rename source candidates per new files in the postimage. I.e. it makes the memory requirement from O(N x M) to O(M). For each dst, we compute similarlity with all sources (i.e. the number of similarity estimate computations is still O(N x M)), but we keep handful best src candidates for each dst. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2008-01-30 05:54:56 +01:00			`unsigned short score;`
			`short name_score;`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`};`

			`static int estimate_similarity(struct diff_filespec *src,`
			`struct diff_filespec *dst,`
			`int minimum_score)`
			`{`
			`/* src points at a file that existed in the original tree (or`
			`* optionally a file in the destination tree) and dst points`
			`* at a newly created file. They may be quite similar, in which`
			`* case we want to say src is renamed to dst or src is copied into`
			`* dst, and then some edit has been applied to dst.`
			`*`
			`* Compare them and return how similar they are, representing`
[PATCH] Optimize diff-tree -[CM] --stdin This attempts to optimize "diff-tree -[CM] --stdin", which compares successible tree pairs. This optimization does not make much sense for other commands in the diff-* brothers. When reading from --stdin and using rename/copy detection, the patch makes diff-tree to read the current index file first. This is done to reuse the optimization used by diff-cache in the non-cached case. Similarity estimator can avoid expanding a blob if the index says what is in the work tree has an exact copy of that blob already expanded. Another optimization the patch makes is to check only file sizes first to terminate similarity estimation early. In order for this to work, it needs a way to tell the size of the blob without expanding it. Since an obvious way of doing it, which is to keep all the blobs previously used in the memory, is too costly, it does so by keeping the filesize for each object it has already seen in memory. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-28 00:56:38 +02:00			`* the score as an integer between 0 and MAX_SCORE.`
			`*`
			`* When there is an exact match, it is considered a better`
			`* match than anything else; the destination does not even`
			`* call into this function in that case.`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`*/`
Fix up diffcore-rename scoring The "score" calculation for diffcore-rename was totally broken. It scaled "score" as score = src_copied * MAX_SCORE / dst->size; which means that you got a 100% similarity score even if src and dest were different, if just every byte of dst was copied from src, even if source was much larger than dst (eg we had copied 85% of the bytes, but _deleted_ the remaining 15%). That's clearly bogus. We should do the score calculation relative not to the destination size, but to the max size of the two. This seems to fix it. Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-13 07:26:34 +01:00			`unsigned long max_size, delta_size, base_size, src_copied, literal_added;`
[PATCH] Use enhanced diff_delta() in the similarity estimator. The diff_delta() interface was extended to reject generating too big a delta while we were working on the packed GIT archive format. Take advantage of that when generating delta in the similarity estimator used in diffcore-rename.c Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-06-29 01:58:27 +02:00			`unsigned long delta_limit;`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`int score;`

[PATCH] Be careful with symlinks when detecting renames and copies. Earlier round was not treating symbolic links carefully enough, and would have produced diff output that renamed/copied then edited the contents of a symbolic link, which made no practical sense. Change it to detect only pure renames. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-23 06:24:49 +02:00			`/* We deal only with regular files. Symlink renames are handled`
			`* only when they are exact matches --- in other words, no edits`
			`* after renaming.`
			`*/`
			`if (!S_ISREG(src->mode) \|\| !S_ISREG(dst->mode))`
			`return 0;`

Fix ugly magic special case in exact rename detection For historical reasons, the exact rename detection had populated the filespecs for the entries it compared, and the rest of the similarity analysis depended on that. I hadn't even bothered to debug why that was the case when I re-did the rename detection, I just made the new one have the same broken behaviour, with a note about this special case. This fixes that fixme. The reason the exact rename detector needed to fill in the file sizes of the files it checked was that the _inexact_ rename detector was broken, and started comparing file sizes before it filled them in. Fixing that allows the exact phase to do the sane thing of never even caring (since all it cares about is really just the SHA1 itself, not the size nor the contents). It turns out that this also indirectly fixes a bug: trying to populate all the filespecs will run out of virtual memory if there is tons and tons of possible rename options. The fuzzy similarity analysis does the right thing in this regard, and free's the blob info after it has generated the hash tables, so the special case code caused more trouble than just some extra illogical code. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-27 01:51:28 +02:00			`/*`
			`* Need to check that source and destination sizes are`
			`* filled in before comparing them.`
			`*`
			`* If we already have "cnt_data" filled in, we know it's`
			`* all good (avoid checking the size for zero, as that`
			`* is a possible size - we really should have a flag to`
			`* say whether the size is valid or not!)`
			`*/`
Rename detection: Avoid repeated filespec population In diffcore_rename, we assume that the blob contents in the filespec aren't required anymore after estimate_similarity has been called and thus we free it. But estimate_similarity might return early when the file sizes differ too much. In that case, cnt_data is never set and the next call to estimate_similarity will populate the filespec again, eventually rereading the same blob over and over again. To fix that, we first get the blob sizes and only when the blob contents are actually required, and when cnt_data will be set, the full filespec is populated, once. Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-01-20 16:59:57 +01:00			`if (!src->cnt_data && diff_populate_filespec(src, 1))`
Fix ugly magic special case in exact rename detection For historical reasons, the exact rename detection had populated the filespecs for the entries it compared, and the rest of the similarity analysis depended on that. I hadn't even bothered to debug why that was the case when I re-did the rename detection, I just made the new one have the same broken behaviour, with a note about this special case. This fixes that fixme. The reason the exact rename detector needed to fill in the file sizes of the files it checked was that the _inexact_ rename detector was broken, and started comparing file sizes before it filled them in. Fixing that allows the exact phase to do the sane thing of never even caring (since all it cares about is really just the SHA1 itself, not the size nor the contents). It turns out that this also indirectly fixes a bug: trying to populate all the filespecs will run out of virtual memory if there is tons and tons of possible rename options. The fuzzy similarity analysis does the right thing in this regard, and free's the blob info after it has generated the hash tables, so the special case code caused more trouble than just some extra illogical code. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-27 01:51:28 +02:00			`return 0;`
Rename detection: Avoid repeated filespec population In diffcore_rename, we assume that the blob contents in the filespec aren't required anymore after estimate_similarity has been called and thus we free it. But estimate_similarity might return early when the file sizes differ too much. In that case, cnt_data is never set and the next call to estimate_similarity will populate the filespec again, eventually rereading the same blob over and over again. To fix that, we first get the blob sizes and only when the blob contents are actually required, and when cnt_data will be set, the full filespec is populated, once. Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-01-20 16:59:57 +01:00			`if (!dst->cnt_data && diff_populate_filespec(dst, 1))`
Fix ugly magic special case in exact rename detection For historical reasons, the exact rename detection had populated the filespecs for the entries it compared, and the rest of the similarity analysis depended on that. I hadn't even bothered to debug why that was the case when I re-did the rename detection, I just made the new one have the same broken behaviour, with a note about this special case. This fixes that fixme. The reason the exact rename detector needed to fill in the file sizes of the files it checked was that the _inexact_ rename detector was broken, and started comparing file sizes before it filled them in. Fixing that allows the exact phase to do the sane thing of never even caring (since all it cares about is really just the SHA1 itself, not the size nor the contents). It turns out that this also indirectly fixes a bug: trying to populate all the filespecs will run out of virtual memory if there is tons and tons of possible rename options. The fuzzy similarity analysis does the right thing in this regard, and free's the blob info after it has generated the hash tables, so the special case code caused more trouble than just some extra illogical code. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-27 01:51:28 +02:00			`return 0;`

Fix up diffcore-rename scoring The "score" calculation for diffcore-rename was totally broken. It scaled "score" as score = src_copied * MAX_SCORE / dst->size; which means that you got a 100% similarity score even if src and dest were different, if just every byte of dst was copied from src, even if source was much larger than dst (eg we had copied 85% of the bytes, but _deleted_ the remaining 15%). That's clearly bogus. We should do the score calculation relative not to the destination size, but to the max size of the two. This seems to fix it. Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-13 07:26:34 +01:00			`max_size = ((src->size > dst->size) ? src->size : dst->size);`
[PATCH] Tweak diffcore-rename heuristics. The heuristics so far was to compare file size change and xdelta size against the average of file size before and after the change. This patch uses the smaller of pre- and post- change file size instead. It also makes a very small performance fix. I didn't measure it; I do not expect it to make any practical difference, but while scanning an already sorted list, breaking out in the middle is the right thing. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-22 00:55:18 +02:00			`base_size = ((src->size < dst->size) ? src->size : dst->size);`
Fix up diffcore-rename scoring The "score" calculation for diffcore-rename was totally broken. It scaled "score" as score = src_copied * MAX_SCORE / dst->size; which means that you got a 100% similarity score even if src and dest were different, if just every byte of dst was copied from src, even if source was much larger than dst (eg we had copied 85% of the bytes, but _deleted_ the remaining 15%). That's clearly bogus. We should do the score calculation relative not to the destination size, but to the max size of the two. This seems to fix it. Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-13 07:26:34 +01:00			`delta_size = max_size - base_size;`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00
[PATCH] Tweak diffcore-rename heuristics. The heuristics so far was to compare file size change and xdelta size against the average of file size before and after the change. This patch uses the smaller of pre- and post- change file size instead. It also makes a very small performance fix. I didn't measure it; I do not expect it to make any practical difference, but while scanning an already sorted list, breaking out in the middle is the right thing. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-22 00:55:18 +02:00			`/* We would not consider edits that change the file size so`
			`* drastically. delta_size must be smaller than`
[PATCH] Fix tweak in similarity estimator. There was a screwy math bug in the estimator that confused what -C1 meant and what -C9 meant, only in one of the early "cheap" check, which resulted in quite confusing behaviour. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-22 10:31:28 +02:00			`* (MAX_SCORE-minimum_score)/MAX_SCORE * min(src->size, dst->size).`
[PATCH] Optimize diff-tree -[CM] --stdin This attempts to optimize "diff-tree -[CM] --stdin", which compares successible tree pairs. This optimization does not make much sense for other commands in the diff-* brothers. When reading from --stdin and using rename/copy detection, the patch makes diff-tree to read the current index file first. This is done to reuse the optimization used by diff-cache in the non-cached case. Similarity estimator can avoid expanding a blob if the index says what is in the work tree has an exact copy of that blob already expanded. Another optimization the patch makes is to check only file sizes first to terminate similarity estimation early. In order for this to work, it needs a way to tell the size of the blob without expanding it. Since an obvious way of doing it, which is to keep all the blobs previously used in the memory, is too costly, it does so by keeping the filesize for each object it has already seen in memory. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-28 00:56:38 +02:00			`*`
[PATCH] Tweak diffcore-rename heuristics. The heuristics so far was to compare file size change and xdelta size against the average of file size before and after the change. This patch uses the smaller of pre- and post- change file size instead. It also makes a very small performance fix. I didn't measure it; I do not expect it to make any practical difference, but while scanning an already sorted list, breaking out in the middle is the right thing. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-22 00:55:18 +02:00			`* Note that base_size == 0 case is handled here already`
			`* and the final score computation below would not have a`
			`* divide-by-zero issue.`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`*/`
diffcore-rename: improve estimate_similarity() heuristics The logic to quickly dismiss potential rename pairs was broken. It would too eagerly dismiss possible renames when all of the difference was due to pure new data (or deleted data). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-02-19 05:12:06 +01:00			`if (max_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`return 0;`

Rename detection: Avoid repeated filespec population In diffcore_rename, we assume that the blob contents in the filespec aren't required anymore after estimate_similarity has been called and thus we free it. But estimate_similarity might return early when the file sizes differ too much. In that case, cnt_data is never set and the next call to estimate_similarity will populate the filespec again, eventually rereading the same blob over and over again. To fix that, we first get the blob sizes and only when the blob contents are actually required, and when cnt_data will be set, the full filespec is populated, once. Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-01-20 16:59:57 +01:00			`if (!src->cnt_data && diff_populate_filespec(src, 0))`
			`return 0;`
			`if (!dst->cnt_data && diff_populate_filespec(dst, 0))`
			`return 0;`

Cast 64 bit off_t to 32 bit size_t Some systems have sizeof(off_t) == 8 while sizeof(size_t) == 4. This implies that we are able to access and work on files whose maximum length is around 2^63-1 bytes, but we can only malloc or mmap somewhat less than 2^32-1 bytes of memory. On such a system an implicit conversion of off_t to size_t can cause the size_t to wrap, resulting in unexpected and exciting behavior. Right now we are working around all gcc warnings generated by the -Wshorten-64-to-32 option by passing the off_t through xsize_t(). In the future we should make xsize_t on such problematic platforms detect the wrapping and die if such a file is accessed. Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2007-03-07 02:44:37 +01:00			`delta_limit = (unsigned long)`
			`(base_size * (MAX_SCORE-minimum_score) / MAX_SCORE);`
diffcore_count_changes: pass diffcore_filespec We may want to use richer information on the data we are dealing with in this function, so instead of passing a buffer address and length, just pass the diffcore_filespec structure. Existing callers always call this function with parameters taken from a filespec anyway, so there is no functionality changes. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-06-29 07:54:37 +02:00			`if (diffcore_count_changes(src, dst,`
diffcore-rename: somewhat optimized. This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-12 12:22:10 +01:00			`&src->cnt_data, &dst->cnt_data,`
diffcore-rename: split out the delta counting code. This is to rework diffcore break/rename/copy detection code so that it does not affected when deltifier code gets improved. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-01 01:01:36 +01:00			`delta_limit,`
			`&src_copied, &literal_added))`
[PATCH] Update rename/copy similarity estimator. The second round similarity estimator simply used the size of the xdelta itself to estimate the extent of damage. This patch keeps that logic to detect big insertions to terminate the check early, but otherwise looks at the generated delta in order to estimate the extent of edit more accurately. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 21:09:32 +02:00			`return 0;`
[PATCH] Tweak count-delta interface Make it return copied source and insertion separately, so that later implementation of heuristics can use them more flexibly. This does not change the heuristics implemented in diffcore-rename nor diffcore-break in any way. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-06-03 10:36:03 +02:00
diffcore-rename: similarity estimator fix. The "similarity" logic was giving added material way too much negative weight. What we wanted to see was how similar the post-change image was compared to the pre-change image, so the natural definition of similarity is how much common things are there, relative to the post-change image's size. This simplifies things a lot. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-03 07:11:25 +01:00			`/* How similar are they?`
			`* what percentage of material in dst are from source?`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`*/`
Fix up diffcore-rename scoring The "score" calculation for diffcore-rename was totally broken. It scaled "score" as score = src_copied * MAX_SCORE / dst->size; which means that you got a 100% similarity score even if src and dest were different, if just every byte of dst was copied from src, even if source was much larger than dst (eg we had copied 85% of the bytes, but _deleted_ the remaining 15%). That's clearly bogus. We should do the score calculation relative not to the destination size, but to the max size of the two. This seems to fix it. Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-13 07:26:34 +01:00			`if (!dst->size)`
diffcore-rename: similarity estimator fix. The "similarity" logic was giving added material way too much negative weight. What we wanted to see was how similar the post-change image was compared to the pre-change image, so the natural definition of similarity is how much common things are there, relative to the post-change image's size. This simplifies things a lot. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-03-03 07:11:25 +01:00			`score = 0; /* should not happen */`
diffcore-rename: don't change similarity index based on basename equality This implements a suggestion from Johannes. It uses a separate field in struct diff_score to keep the result of the file name comparison in the rename detection logic. This reverts the value of the similarity index to be a function of file contents, only, and basename comparison is only used to decide between files with equal amounts of content changes. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-06-25 00:23:28 +02:00			`else`
Cast 64 bit off_t to 32 bit size_t Some systems have sizeof(off_t) == 8 while sizeof(size_t) == 4. This implies that we are able to access and work on files whose maximum length is around 2^63-1 bytes, but we can only malloc or mmap somewhat less than 2^32-1 bytes of memory. On such a system an implicit conversion of off_t to size_t can cause the size_t to wrap, resulting in unexpected and exciting behavior. Right now we are working around all gcc warnings generated by the -Wshorten-64-to-32 option by passing the off_t through xsize_t(). In the future we should make xsize_t on such problematic platforms detect the wrapping and die if such a file is accessed. Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Signed-off-by: Junio C Hamano <junkio@cox.net> 2007-03-07 02:44:37 +01:00			`score = (int)(src_copied * MAX_SCORE / max_size);`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`return score;`
			`}`

Plug diff leaks. It is a bit embarrassing that it took this long for a fix since the problem was first reported on Aug 13th. Message-ID: <87y876gl1r.wl@mail2.atmark-techno.com> From: Yasushi SHOJI <yashi@atmark-techno.com> Newsgroups: gmane.comp.version-control.git Subject: [patch] possible memory leak in diff.c::diff_free_filepair() Date: Sat, 13 Aug 2005 19:58:56 +0900 This time I used valgrind to make sure that it does not overeagerly discard memory that is still being used. Signed-off-by: Junio C Hamano <junkio@cox.net> 2005-09-16 01:13:43 +02:00			`static void record_rename_pair(int dst_index, int src_index, int score)`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`{`
Ref-count the filespecs used by diffcore Rather than copy the filespecs when introducing new versions of them (for rename or copy detection), use a refcount and increment the count when reusing the diff_filespec. This avoids unnecessary allocations, but the real reason behind this is a future enhancement: we will want to track shared data across the copy/rename detection. In order to efficiently notice when a filespec is used by a rename, the rename machinery wants to keep track of a rename usage count which is shared across all different users of the filespec. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:19:10 +02:00			`struct diff_filespec src, dst;`
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`struct diff_filepair *dp;`
[PATCH] Rename/copy detection fix. The rename/copy detection logic in earlier round was only good enough to show patch output and discussion on the mailing list about the diff-raw format updates revealed many problems with it. This patch fixes all the ones known to me, without making things I want to do later impossible, mostly related to patch reordering. (1) Earlier rename/copy detector determined which one is rename and which one is copy too early, which made it impossible to later introduce diffcore transformers to reorder patches. This patch fixes it by moving that logic to the very end of the processing. (2) Earlier output routine diff_flush() was pruning all the "no-change" entries indiscriminatingly. This was done due to my false assumption that one of the requirements in the diff-raw output was not to show such an entry (which resulted in my incorrect comment about "diff-helper never being able to be equivalent to built-in diff driver"). My special thanks go to Linus for correcting me about this. When we produce diff-raw output, for the downstream to be able to tell renames from copies, sometimes it _is_ necessary to output "no-change" entries, and this patch adds diffcore_prune() function for doing it. (3) Earlier diff_filepair structure was trying to be not too specific about rename/copy operations, but the purpose of the structure was to record one or two paths, which _was_ indeed about rename/copy. This patch discards xfrm_msg field which was trying to be generic for this wrong reason, and introduces a couple of fields (rename_score and rename_rank) that are explicitly specific to rename/copy logic. One thing to note is that the information in a single diff_filepair structure _still_ does not distinguish renames from copies, and it is deliberately so. This is to allow patches to be reordered in later stages. (4) This patch also adds some tests about diff-raw format output and makes sure that necessary "no-change" entries appear on the output. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-23 06:26:09 +02:00
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`if (rename_dst[dst_index].pair)`
			`die("internal error: dst already matched.");`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00
diffcore-rename: record filepair for rename src This will allow us to later skip unmodified entries added due to "-C -C". We might also want to do something similar to rename_dst side, but that would only be for the sake of symmetry and not necessary for this series. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-01-06 22:50:05 +01:00			`src = rename_src[src_index].p->one;`
copy vs rename detection: avoid unnecessary O(n*m) loops The core rename detection had some rather stupid code to check if a pathname was used by a later modification or rename, which basically walked the whole pathname space for all renames for each rename, in order to tell whether it was a pure rename (no remaining users) or should be considered a copy (other users of the source file remaining). That's really silly, since we can just keep a count of users around, and replace all those complex and expensive loops with just testing that simple counter (but this all depends on the previous commit that shared the diff_filespec data structure by using a separate reference count). Note that the reference count is not the same as the rename count: they behave otherwise rather similarly, but the reference count is tied to the allocation (and decremented at de-allocation, so that when it turns zero we can get rid of the memory), while the rename count is tied to the renames and is decremented when we find a rename (so that when it turns zero we know that it was a rename, not a copy). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:20:56 +02:00			`src->rename_used++;`
Ref-count the filespecs used by diffcore Rather than copy the filespecs when introducing new versions of them (for rename or copy detection), use a refcount and increment the count when reusing the diff_filespec. This avoids unnecessary allocations, but the real reason behind this is a future enhancement: we will want to track shared data across the copy/rename detection. In order to efficiently notice when a filespec is used by a rename, the rename machinery wants to keep track of a rename usage count which is shared across all different users of the filespec. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:19:10 +02:00			`src->count++;`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`dst = rename_dst[dst_index].two;`
Ref-count the filespecs used by diffcore Rather than copy the filespecs when introducing new versions of them (for rename or copy detection), use a refcount and increment the count when reusing the diff_filespec. This avoids unnecessary allocations, but the real reason behind this is a future enhancement: we will want to track shared data across the copy/rename detection. In order to efficiently notice when a filespec is used by a rename, the rename machinery wants to keep track of a rename usage count which is shared across all different users of the filespec. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:19:10 +02:00			`dst->count++;`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00
Ref-count the filespecs used by diffcore Rather than copy the filespecs when introducing new versions of them (for rename or copy detection), use a refcount and increment the count when reusing the diff_filespec. This avoids unnecessary allocations, but the real reason behind this is a future enhancement: we will want to track shared data across the copy/rename detection. In order to efficiently notice when a filespec is used by a rename, the rename machinery wants to keep track of a rename usage count which is shared across all different users of the filespec. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:19:10 +02:00			`dp = diff_queue(NULL, src, dst);`
diff.c: do not use pathname comparison to tell renames The final output from diff used to compare pathnames between preimage and postimage to tell if the filepair is a rename/copy. By explicitly marking the filepair created by diffcore_rename(), the output routine, resolve_rename_copy(), does not have to do so anymore. This helps feeding a filepair that has different pathnames in one and two elements to the diff machinery (most notably, comparing two blobs). Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-08-03 21:01:01 +02:00			`dp->renamed_pair = 1;`
diffcore-rename: fix merging back a broken pair. When a broken pair is matched up by rename detector to be merged back, we do not want to say it is "dissimilar" with the similarity index. The output should just say they were changed, taking the break score left by the earlier diffcore-break run if any. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-04-09 05:17:46 +02:00			`if (!strcmp(src->path, dst->path))`
			`dp->score = rename_src[src_index].score;`
			`else`
			`dp->score = score;`
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`rename_dst[dst_index].pair = dp;`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`}`

			`/*`
			`* We sort the rename similarity matrix with the score, in descending`
[PATCH] Fix the way diffcore-rename records unremoved source. Earier version of diffcore-rename used to keep unmodified filepair in its output so that the last stage of the processing that tells renames from copies can make all of rename/copy to copies. However this had a bad interaction with other diffcore filters that wanted to run after diffcore-rename, in that such unmodified filepair must be retained for proper distinction between renames and copies to happen. This patch fixes the problem by changing the way diffcore-rename records the information needed to distinguish "all are copies" case and "the last one is a rename" case. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-28 00:55:55 +02:00			`* order (the most similar first).`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`*/`
			`static int score_compare(const void a_, const void b_)`
			`{`
			`const struct diff_score a = a_, b = b_;`
diffcore-rename: don't change similarity index based on basename equality This implements a suggestion from Johannes. It uses a separate field in struct diff_score to keep the result of the file name comparison in the rename detection logic. This reverts the value of the similarity index to be a function of file contents, only, and basename comparison is only used to decide between files with equal amounts of content changes. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-06-25 00:23:28 +02:00
Optimize rename detection for a huge diff When there are N deleted paths and M created paths, we used to allocate (N x M) "struct diff_score" that record how similar each of the pair is, and picked the <src,dst> pair that gives the best match first, and then went on to process worse matches. This sorting is done so that when two new files in the postimage that are similar to the same file deleted from the preimage, we can process the more similar one first, and when processing the second one, it can notice "Ah, the source I was planning to say I am a copy of is already taken by somebody else" and continue on to match itself with another file in the preimage with a lessor match. This matters to a change introduced between 1.5.3.X series and 1.5.4-rc, that lets the code to favor unused matches first and then falls back to using already used matches. This instead allocates and keeps only a handful rename source candidates per new files in the postimage. I.e. it makes the memory requirement from O(N x M) to O(M). For each dst, we compute similarlity with all sources (i.e. the number of similarity estimate computations is still O(N x M)), but we keep handful best src candidates for each dst. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2008-01-30 05:54:56 +01:00			`/* sink the unused ones to the bottom */`
			`if (a->dst < 0)`
			`return (0 <= b->dst);`
			`else if (b->dst < 0)`
			`return -1;`

diffcore-rename: don't change similarity index based on basename equality This implements a suggestion from Johannes. It uses a separate field in struct diff_score to keep the result of the file name comparison in the rename detection logic. This reverts the value of the similarity index to be a function of file contents, only, and basename comparison is only used to decide between files with equal amounts of content changes. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-06-25 00:23:28 +02:00			`if (a->score == b->score)`
			`return b->name_score - a->name_score;`

[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`return b->score - a->score;`
			`}`

Do linear-time/space rename logic for exact renames This implements a smarter rename detector for exact renames, which rather than doing a pairwise comparison (time O(m*n)) will just hash the files into a hash-table (size O(n+m)), and only do pairwise comparisons to renames that have the same hash (time O(n+m) except for unrealistic hash collissions, which we just cull aggressively). Admittedly the exact rename case is not nearly as interesting as the generic case, but it's an important case none-the-less. A similar general approach should work for the generic case too, but even then you do need to handle the exact renames/copies separately (to avoid the inevitable added cost factor that comes from the _size_ of the file), so this is worth doing. In the expectation that we will indeed do the same hashing trick for the general rename case, this code uses a generic hash-table implementation that can be used for other things too. In fact, we might be able to consolidate some of our existing hash tables with the new generic code in hash.[ch]. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:23:26 +02:00			`struct file_similarity {`
			`int src_dst, index;`
			`struct diff_filespec *filespec;`
			`struct file_similarity *next;`
			`};`

			`static int find_identical_files(struct file_similarity *src,`
for_each_hash: allow passing a 'void *data' pointer to callback For the find_exact_renames() function, this allows us to pass the diff_options structure pointer to the low-level routines. We will use that to distinguish between the "rename" and "copy" cases. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-02-19 04:55:19 +01:00			`struct file_similarity *dst,`
			`struct diff_options *options)`
Do linear-time/space rename logic for exact renames This implements a smarter rename detector for exact renames, which rather than doing a pairwise comparison (time O(m*n)) will just hash the files into a hash-table (size O(n+m)), and only do pairwise comparisons to renames that have the same hash (time O(n+m) except for unrealistic hash collissions, which we just cull aggressively). Admittedly the exact rename case is not nearly as interesting as the generic case, but it's an important case none-the-less. A similar general approach should work for the generic case too, but even then you do need to handle the exact renames/copies separately (to avoid the inevitable added cost factor that comes from the _size_ of the file), so this is worth doing. In the expectation that we will indeed do the same hashing trick for the general rename case, this code uses a generic hash-table implementation that can be used for other things too. In fact, we might be able to consolidate some of our existing hash tables with the new generic code in hash.[ch]. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:23:26 +02:00			`{`
			`int renames = 0;`

			`/*`
			`* Walk over all the destinations ...`
			`*/`
			`do {`
Fix a pathological case in git detecting proper renames Kumar Gala had a case in the u-boot archive with multiple renames of files with identical contents, and git would turn those into multiple "copy" operations of one of the sources, and just deleting the other sources. This patch makes the git exact rename detection prefer to spread out the renames over the multiple sources, rather than do multiple copies of one source. NOTE! The changes are a bit larger than required, because I also renamed the variables named "one" and "two" to "target" and "source" respectively. That makes the logic easier to follow, especially as the "one" was illogically the target and not the soruce, for purely historical reasons (this piece of code used to traverse over sources and targets in the wrong order, and when we fixed that, we didn't fix the names back then. So I fixed them now). The important part of this change is just the trivial score calculations for when files have identical contents: /* Give higher scores to sources that haven't been used already / score = !source->rename_used; score += basename_same(source, target); and when we have multiple choices we'll now pick the choice that gets the best rename score, rather than only looking at whether the basename matched. It's worth noting a few gotchas: - this scoring is currently only done for the "exact match" case. In particular, in Kumar's example, even after this patch, the inexact match case is still done as a copy+delete rather than as two renames: delete mode 100644 board/cds/mpc8555cds/u-boot.lds copy board/{cds => freescale}/mpc8541cds/u-boot.lds (97%) rename board/{cds/mpc8541cds => freescale/mpc8555cds}/u-boot.lds (97%) because apparently the "cds/mpc8541cds/u-boot.lds" copy looked a bit more similar to both end results. That said, I suspect* we just have the exact same issue there - the similarity analysis just gave identical (or at least very _close_ to identical) similarity points, and we do not have any logic to prefer multiple renames over a copy/delete there. That is a separate patch. - When you have identical contents and identical basenames, the actual entry that is chosen is still picked fairly "at random" for the first one (but the subsequent ones will prefer entries that haven't already been used). It's not actually really random, in that it actually depends on the relative alphabetical order of the files (which in turn will have impacted the order that the entries got hashed!), so it gives consistent results that can be explained. But I wanted to point it out as an issue for when anybody actually does cross-renames. In Kumar's case the choice is the right one (and for a single normal directory rename it should always be, since the relative alphabetical sorting of the files will be identical), and we now get: rename board/{cds => freescale}/mpc8541cds/init.S (100%) rename board/{cds => freescale}/mpc8548cds/init.S (100%) which is the "expected" answer. However, it might still be better to change the pedantic "exact same basename" on/off choice into a more graduated "how similar are the pathnames" scoring situation, in order to be more likely to get the exact rename choice that people expect to see, rather than other alternatives that may technically be equally good, but are surprising to a human. It's also unclear whether we should consider "basenames are equal" or "have already used this as a source" to be more important. This gives them equal weight, but I suspect we might want to just multiple the "basenames are equal" weight by two, or something, to prefer equal basenames even if that causes a copy/delete pair. I dunno. Anyway, what I'm just saying in a really long-winded manner is that I think this is right as-is, but it's not the complete solution, and it may want some further tweaking in the future. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-11-29 22:30:13 +01:00			`struct diff_filespec *target = dst->filespec;`
Do linear-time/space rename logic for exact renames This implements a smarter rename detector for exact renames, which rather than doing a pairwise comparison (time O(m*n)) will just hash the files into a hash-table (size O(n+m)), and only do pairwise comparisons to renames that have the same hash (time O(n+m) except for unrealistic hash collissions, which we just cull aggressively). Admittedly the exact rename case is not nearly as interesting as the generic case, but it's an important case none-the-less. A similar general approach should work for the generic case too, but even then you do need to handle the exact renames/copies separately (to avoid the inevitable added cost factor that comes from the _size_ of the file), so this is worth doing. In the expectation that we will indeed do the same hashing trick for the general rename case, this code uses a generic hash-table implementation that can be used for other things too. In fact, we might be able to consolidate some of our existing hash tables with the new generic code in hash.[ch]. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:23:26 +02:00			`struct file_similarity p, best;`
Fix a pathological case in git detecting proper renames Kumar Gala had a case in the u-boot archive with multiple renames of files with identical contents, and git would turn those into multiple "copy" operations of one of the sources, and just deleting the other sources. This patch makes the git exact rename detection prefer to spread out the renames over the multiple sources, rather than do multiple copies of one source. NOTE! The changes are a bit larger than required, because I also renamed the variables named "one" and "two" to "target" and "source" respectively. That makes the logic easier to follow, especially as the "one" was illogically the target and not the soruce, for purely historical reasons (this piece of code used to traverse over sources and targets in the wrong order, and when we fixed that, we didn't fix the names back then. So I fixed them now). The important part of this change is just the trivial score calculations for when files have identical contents: /* Give higher scores to sources that haven't been used already / score = !source->rename_used; score += basename_same(source, target); and when we have multiple choices we'll now pick the choice that gets the best rename score, rather than only looking at whether the basename matched. It's worth noting a few gotchas: - this scoring is currently only done for the "exact match" case. In particular, in Kumar's example, even after this patch, the inexact match case is still done as a copy+delete rather than as two renames: delete mode 100644 board/cds/mpc8555cds/u-boot.lds copy board/{cds => freescale}/mpc8541cds/u-boot.lds (97%) rename board/{cds/mpc8541cds => freescale/mpc8555cds}/u-boot.lds (97%) because apparently the "cds/mpc8541cds/u-boot.lds" copy looked a bit more similar to both end results. That said, I suspect* we just have the exact same issue there - the similarity analysis just gave identical (or at least very _close_ to identical) similarity points, and we do not have any logic to prefer multiple renames over a copy/delete there. That is a separate patch. - When you have identical contents and identical basenames, the actual entry that is chosen is still picked fairly "at random" for the first one (but the subsequent ones will prefer entries that haven't already been used). It's not actually really random, in that it actually depends on the relative alphabetical order of the files (which in turn will have impacted the order that the entries got hashed!), so it gives consistent results that can be explained. But I wanted to point it out as an issue for when anybody actually does cross-renames. In Kumar's case the choice is the right one (and for a single normal directory rename it should always be, since the relative alphabetical sorting of the files will be identical), and we now get: rename board/{cds => freescale}/mpc8541cds/init.S (100%) rename board/{cds => freescale}/mpc8548cds/init.S (100%) which is the "expected" answer. However, it might still be better to change the pedantic "exact same basename" on/off choice into a more graduated "how similar are the pathnames" scoring situation, in order to be more likely to get the exact rename choice that people expect to see, rather than other alternatives that may technically be equally good, but are surprising to a human. It's also unclear whether we should consider "basenames are equal" or "have already used this as a source" to be more important. This gives them equal weight, but I suspect we might want to just multiple the "basenames are equal" weight by two, or something, to prefer equal basenames even if that causes a copy/delete pair. I dunno. Anyway, what I'm just saying in a really long-winded manner is that I think this is right as-is, but it's not the complete solution, and it may want some further tweaking in the future. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-11-29 22:30:13 +01:00			`int i = 100, best_score = -1;`
Do linear-time/space rename logic for exact renames This implements a smarter rename detector for exact renames, which rather than doing a pairwise comparison (time O(m*n)) will just hash the files into a hash-table (size O(n+m)), and only do pairwise comparisons to renames that have the same hash (time O(n+m) except for unrealistic hash collissions, which we just cull aggressively). Admittedly the exact rename case is not nearly as interesting as the generic case, but it's an important case none-the-less. A similar general approach should work for the generic case too, but even then you do need to handle the exact renames/copies separately (to avoid the inevitable added cost factor that comes from the _size_ of the file), so this is worth doing. In the expectation that we will indeed do the same hashing trick for the general rename case, this code uses a generic hash-table implementation that can be used for other things too. In fact, we might be able to consolidate some of our existing hash tables with the new generic code in hash.[ch]. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:23:26 +02:00
			`/*`
			`* .. to find the best source match`
			`*/`
			`best = NULL;`
			`for (p = src; p; p = p->next) {`
Fix a pathological case in git detecting proper renames Kumar Gala had a case in the u-boot archive with multiple renames of files with identical contents, and git would turn those into multiple "copy" operations of one of the sources, and just deleting the other sources. This patch makes the git exact rename detection prefer to spread out the renames over the multiple sources, rather than do multiple copies of one source. NOTE! The changes are a bit larger than required, because I also renamed the variables named "one" and "two" to "target" and "source" respectively. That makes the logic easier to follow, especially as the "one" was illogically the target and not the soruce, for purely historical reasons (this piece of code used to traverse over sources and targets in the wrong order, and when we fixed that, we didn't fix the names back then. So I fixed them now). The important part of this change is just the trivial score calculations for when files have identical contents: /* Give higher scores to sources that haven't been used already / score = !source->rename_used; score += basename_same(source, target); and when we have multiple choices we'll now pick the choice that gets the best rename score, rather than only looking at whether the basename matched. It's worth noting a few gotchas: - this scoring is currently only done for the "exact match" case. In particular, in Kumar's example, even after this patch, the inexact match case is still done as a copy+delete rather than as two renames: delete mode 100644 board/cds/mpc8555cds/u-boot.lds copy board/{cds => freescale}/mpc8541cds/u-boot.lds (97%) rename board/{cds/mpc8541cds => freescale/mpc8555cds}/u-boot.lds (97%) because apparently the "cds/mpc8541cds/u-boot.lds" copy looked a bit more similar to both end results. That said, I suspect* we just have the exact same issue there - the similarity analysis just gave identical (or at least very _close_ to identical) similarity points, and we do not have any logic to prefer multiple renames over a copy/delete there. That is a separate patch. - When you have identical contents and identical basenames, the actual entry that is chosen is still picked fairly "at random" for the first one (but the subsequent ones will prefer entries that haven't already been used). It's not actually really random, in that it actually depends on the relative alphabetical order of the files (which in turn will have impacted the order that the entries got hashed!), so it gives consistent results that can be explained. But I wanted to point it out as an issue for when anybody actually does cross-renames. In Kumar's case the choice is the right one (and for a single normal directory rename it should always be, since the relative alphabetical sorting of the files will be identical), and we now get: rename board/{cds => freescale}/mpc8541cds/init.S (100%) rename board/{cds => freescale}/mpc8548cds/init.S (100%) which is the "expected" answer. However, it might still be better to change the pedantic "exact same basename" on/off choice into a more graduated "how similar are the pathnames" scoring situation, in order to be more likely to get the exact rename choice that people expect to see, rather than other alternatives that may technically be equally good, but are surprising to a human. It's also unclear whether we should consider "basenames are equal" or "have already used this as a source" to be more important. This gives them equal weight, but I suspect we might want to just multiple the "basenames are equal" weight by two, or something, to prefer equal basenames even if that causes a copy/delete pair. I dunno. Anyway, what I'm just saying in a really long-winded manner is that I think this is right as-is, but it's not the complete solution, and it may want some further tweaking in the future. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-11-29 22:30:13 +01:00			`int score;`
			`struct diff_filespec *source = p->filespec;`
Do linear-time/space rename logic for exact renames This implements a smarter rename detector for exact renames, which rather than doing a pairwise comparison (time O(m*n)) will just hash the files into a hash-table (size O(n+m)), and only do pairwise comparisons to renames that have the same hash (time O(n+m) except for unrealistic hash collissions, which we just cull aggressively). Admittedly the exact rename case is not nearly as interesting as the generic case, but it's an important case none-the-less. A similar general approach should work for the generic case too, but even then you do need to handle the exact renames/copies separately (to avoid the inevitable added cost factor that comes from the _size_ of the file), so this is worth doing. In the expectation that we will indeed do the same hashing trick for the general rename case, this code uses a generic hash-table implementation that can be used for other things too. In fact, we might be able to consolidate some of our existing hash tables with the new generic code in hash.[ch]. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:23:26 +02:00
Fix typos / spelling in comments Signed-off-by: Mike Ralphson <mike@abacus.co.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-04-17 20:13:30 +02:00			`/* False hash collision? */`
Fix a pathological case in git detecting proper renames Kumar Gala had a case in the u-boot archive with multiple renames of files with identical contents, and git would turn those into multiple "copy" operations of one of the sources, and just deleting the other sources. This patch makes the git exact rename detection prefer to spread out the renames over the multiple sources, rather than do multiple copies of one source. NOTE! The changes are a bit larger than required, because I also renamed the variables named "one" and "two" to "target" and "source" respectively. That makes the logic easier to follow, especially as the "one" was illogically the target and not the soruce, for purely historical reasons (this piece of code used to traverse over sources and targets in the wrong order, and when we fixed that, we didn't fix the names back then. So I fixed them now). The important part of this change is just the trivial score calculations for when files have identical contents: /* Give higher scores to sources that haven't been used already / score = !source->rename_used; score += basename_same(source, target); and when we have multiple choices we'll now pick the choice that gets the best rename score, rather than only looking at whether the basename matched. It's worth noting a few gotchas: - this scoring is currently only done for the "exact match" case. In particular, in Kumar's example, even after this patch, the inexact match case is still done as a copy+delete rather than as two renames: delete mode 100644 board/cds/mpc8555cds/u-boot.lds copy board/{cds => freescale}/mpc8541cds/u-boot.lds (97%) rename board/{cds/mpc8541cds => freescale/mpc8555cds}/u-boot.lds (97%) because apparently the "cds/mpc8541cds/u-boot.lds" copy looked a bit more similar to both end results. That said, I suspect* we just have the exact same issue there - the similarity analysis just gave identical (or at least very _close_ to identical) similarity points, and we do not have any logic to prefer multiple renames over a copy/delete there. That is a separate patch. - When you have identical contents and identical basenames, the actual entry that is chosen is still picked fairly "at random" for the first one (but the subsequent ones will prefer entries that haven't already been used). It's not actually really random, in that it actually depends on the relative alphabetical order of the files (which in turn will have impacted the order that the entries got hashed!), so it gives consistent results that can be explained. But I wanted to point it out as an issue for when anybody actually does cross-renames. In Kumar's case the choice is the right one (and for a single normal directory rename it should always be, since the relative alphabetical sorting of the files will be identical), and we now get: rename board/{cds => freescale}/mpc8541cds/init.S (100%) rename board/{cds => freescale}/mpc8548cds/init.S (100%) which is the "expected" answer. However, it might still be better to change the pedantic "exact same basename" on/off choice into a more graduated "how similar are the pathnames" scoring situation, in order to be more likely to get the exact rename choice that people expect to see, rather than other alternatives that may technically be equally good, but are surprising to a human. It's also unclear whether we should consider "basenames are equal" or "have already used this as a source" to be more important. This gives them equal weight, but I suspect we might want to just multiple the "basenames are equal" weight by two, or something, to prefer equal basenames even if that causes a copy/delete pair. I dunno. Anyway, what I'm just saying in a really long-winded manner is that I think this is right as-is, but it's not the complete solution, and it may want some further tweaking in the future. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-11-29 22:30:13 +01:00			`if (hashcmp(source->sha1, target->sha1))`
Do linear-time/space rename logic for exact renames This implements a smarter rename detector for exact renames, which rather than doing a pairwise comparison (time O(m*n)) will just hash the files into a hash-table (size O(n+m)), and only do pairwise comparisons to renames that have the same hash (time O(n+m) except for unrealistic hash collissions, which we just cull aggressively). Admittedly the exact rename case is not nearly as interesting as the generic case, but it's an important case none-the-less. A similar general approach should work for the generic case too, but even then you do need to handle the exact renames/copies separately (to avoid the inevitable added cost factor that comes from the _size_ of the file), so this is worth doing. In the expectation that we will indeed do the same hashing trick for the general rename case, this code uses a generic hash-table implementation that can be used for other things too. In fact, we might be able to consolidate some of our existing hash tables with the new generic code in hash.[ch]. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:23:26 +02:00			`continue;`
			`/* Non-regular files? If so, the modes must match! */`
Fix a pathological case in git detecting proper renames Kumar Gala had a case in the u-boot archive with multiple renames of files with identical contents, and git would turn those into multiple "copy" operations of one of the sources, and just deleting the other sources. This patch makes the git exact rename detection prefer to spread out the renames over the multiple sources, rather than do multiple copies of one source. NOTE! The changes are a bit larger than required, because I also renamed the variables named "one" and "two" to "target" and "source" respectively. That makes the logic easier to follow, especially as the "one" was illogically the target and not the soruce, for purely historical reasons (this piece of code used to traverse over sources and targets in the wrong order, and when we fixed that, we didn't fix the names back then. So I fixed them now). The important part of this change is just the trivial score calculations for when files have identical contents: /* Give higher scores to sources that haven't been used already / score = !source->rename_used; score += basename_same(source, target); and when we have multiple choices we'll now pick the choice that gets the best rename score, rather than only looking at whether the basename matched. It's worth noting a few gotchas: - this scoring is currently only done for the "exact match" case. In particular, in Kumar's example, even after this patch, the inexact match case is still done as a copy+delete rather than as two renames: delete mode 100644 board/cds/mpc8555cds/u-boot.lds copy board/{cds => freescale}/mpc8541cds/u-boot.lds (97%) rename board/{cds/mpc8541cds => freescale/mpc8555cds}/u-boot.lds (97%) because apparently the "cds/mpc8541cds/u-boot.lds" copy looked a bit more similar to both end results. That said, I suspect* we just have the exact same issue there - the similarity analysis just gave identical (or at least very _close_ to identical) similarity points, and we do not have any logic to prefer multiple renames over a copy/delete there. That is a separate patch. - When you have identical contents and identical basenames, the actual entry that is chosen is still picked fairly "at random" for the first one (but the subsequent ones will prefer entries that haven't already been used). It's not actually really random, in that it actually depends on the relative alphabetical order of the files (which in turn will have impacted the order that the entries got hashed!), so it gives consistent results that can be explained. But I wanted to point it out as an issue for when anybody actually does cross-renames. In Kumar's case the choice is the right one (and for a single normal directory rename it should always be, since the relative alphabetical sorting of the files will be identical), and we now get: rename board/{cds => freescale}/mpc8541cds/init.S (100%) rename board/{cds => freescale}/mpc8548cds/init.S (100%) which is the "expected" answer. However, it might still be better to change the pedantic "exact same basename" on/off choice into a more graduated "how similar are the pathnames" scoring situation, in order to be more likely to get the exact rename choice that people expect to see, rather than other alternatives that may technically be equally good, but are surprising to a human. It's also unclear whether we should consider "basenames are equal" or "have already used this as a source" to be more important. This gives them equal weight, but I suspect we might want to just multiple the "basenames are equal" weight by two, or something, to prefer equal basenames even if that causes a copy/delete pair. I dunno. Anyway, what I'm just saying in a really long-winded manner is that I think this is right as-is, but it's not the complete solution, and it may want some further tweaking in the future. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-11-29 22:30:13 +01:00			`if (!S_ISREG(source->mode) \|\| !S_ISREG(target->mode)) {`
			`if (source->mode != target->mode)`
Do linear-time/space rename logic for exact renames This implements a smarter rename detector for exact renames, which rather than doing a pairwise comparison (time O(m*n)) will just hash the files into a hash-table (size O(n+m)), and only do pairwise comparisons to renames that have the same hash (time O(n+m) except for unrealistic hash collissions, which we just cull aggressively). Admittedly the exact rename case is not nearly as interesting as the generic case, but it's an important case none-the-less. A similar general approach should work for the generic case too, but even then you do need to handle the exact renames/copies separately (to avoid the inevitable added cost factor that comes from the _size_ of the file), so this is worth doing. In the expectation that we will indeed do the same hashing trick for the general rename case, this code uses a generic hash-table implementation that can be used for other things too. In fact, we might be able to consolidate some of our existing hash tables with the new generic code in hash.[ch]. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:23:26 +02:00			`continue;`
			`}`
Fix a pathological case in git detecting proper renames Kumar Gala had a case in the u-boot archive with multiple renames of files with identical contents, and git would turn those into multiple "copy" operations of one of the sources, and just deleting the other sources. This patch makes the git exact rename detection prefer to spread out the renames over the multiple sources, rather than do multiple copies of one source. NOTE! The changes are a bit larger than required, because I also renamed the variables named "one" and "two" to "target" and "source" respectively. That makes the logic easier to follow, especially as the "one" was illogically the target and not the soruce, for purely historical reasons (this piece of code used to traverse over sources and targets in the wrong order, and when we fixed that, we didn't fix the names back then. So I fixed them now). The important part of this change is just the trivial score calculations for when files have identical contents: /* Give higher scores to sources that haven't been used already / score = !source->rename_used; score += basename_same(source, target); and when we have multiple choices we'll now pick the choice that gets the best rename score, rather than only looking at whether the basename matched. It's worth noting a few gotchas: - this scoring is currently only done for the "exact match" case. In particular, in Kumar's example, even after this patch, the inexact match case is still done as a copy+delete rather than as two renames: delete mode 100644 board/cds/mpc8555cds/u-boot.lds copy board/{cds => freescale}/mpc8541cds/u-boot.lds (97%) rename board/{cds/mpc8541cds => freescale/mpc8555cds}/u-boot.lds (97%) because apparently the "cds/mpc8541cds/u-boot.lds" copy looked a bit more similar to both end results. That said, I suspect* we just have the exact same issue there - the similarity analysis just gave identical (or at least very _close_ to identical) similarity points, and we do not have any logic to prefer multiple renames over a copy/delete there. That is a separate patch. - When you have identical contents and identical basenames, the actual entry that is chosen is still picked fairly "at random" for the first one (but the subsequent ones will prefer entries that haven't already been used). It's not actually really random, in that it actually depends on the relative alphabetical order of the files (which in turn will have impacted the order that the entries got hashed!), so it gives consistent results that can be explained. But I wanted to point it out as an issue for when anybody actually does cross-renames. In Kumar's case the choice is the right one (and for a single normal directory rename it should always be, since the relative alphabetical sorting of the files will be identical), and we now get: rename board/{cds => freescale}/mpc8541cds/init.S (100%) rename board/{cds => freescale}/mpc8548cds/init.S (100%) which is the "expected" answer. However, it might still be better to change the pedantic "exact same basename" on/off choice into a more graduated "how similar are the pathnames" scoring situation, in order to be more likely to get the exact rename choice that people expect to see, rather than other alternatives that may technically be equally good, but are surprising to a human. It's also unclear whether we should consider "basenames are equal" or "have already used this as a source" to be more important. This gives them equal weight, but I suspect we might want to just multiple the "basenames are equal" weight by two, or something, to prefer equal basenames even if that causes a copy/delete pair. I dunno. Anyway, what I'm just saying in a really long-winded manner is that I think this is right as-is, but it's not the complete solution, and it may want some further tweaking in the future. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-11-29 22:30:13 +01:00			`/* Give higher scores to sources that haven't been used already */`
			`score = !source->rename_used;`
diffcore-rename: properly honor the difference between -M and -C We would allow rename detection to do copy detection even when asked purely for renames. That confuses users, but more importantly it can terminally confuse the recursive merge rename logic. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-02-19 05:10:32 +01:00			`if (source->rename_used && options->detect_rename != DIFF_DETECT_COPY)`
			`continue;`
Fix a pathological case in git detecting proper renames Kumar Gala had a case in the u-boot archive with multiple renames of files with identical contents, and git would turn those into multiple "copy" operations of one of the sources, and just deleting the other sources. This patch makes the git exact rename detection prefer to spread out the renames over the multiple sources, rather than do multiple copies of one source. NOTE! The changes are a bit larger than required, because I also renamed the variables named "one" and "two" to "target" and "source" respectively. That makes the logic easier to follow, especially as the "one" was illogically the target and not the soruce, for purely historical reasons (this piece of code used to traverse over sources and targets in the wrong order, and when we fixed that, we didn't fix the names back then. So I fixed them now). The important part of this change is just the trivial score calculations for when files have identical contents: /* Give higher scores to sources that haven't been used already / score = !source->rename_used; score += basename_same(source, target); and when we have multiple choices we'll now pick the choice that gets the best rename score, rather than only looking at whether the basename matched. It's worth noting a few gotchas: - this scoring is currently only done for the "exact match" case. In particular, in Kumar's example, even after this patch, the inexact match case is still done as a copy+delete rather than as two renames: delete mode 100644 board/cds/mpc8555cds/u-boot.lds copy board/{cds => freescale}/mpc8541cds/u-boot.lds (97%) rename board/{cds/mpc8541cds => freescale/mpc8555cds}/u-boot.lds (97%) because apparently the "cds/mpc8541cds/u-boot.lds" copy looked a bit more similar to both end results. That said, I suspect* we just have the exact same issue there - the similarity analysis just gave identical (or at least very _close_ to identical) similarity points, and we do not have any logic to prefer multiple renames over a copy/delete there. That is a separate patch. - When you have identical contents and identical basenames, the actual entry that is chosen is still picked fairly "at random" for the first one (but the subsequent ones will prefer entries that haven't already been used). It's not actually really random, in that it actually depends on the relative alphabetical order of the files (which in turn will have impacted the order that the entries got hashed!), so it gives consistent results that can be explained. But I wanted to point it out as an issue for when anybody actually does cross-renames. In Kumar's case the choice is the right one (and for a single normal directory rename it should always be, since the relative alphabetical sorting of the files will be identical), and we now get: rename board/{cds => freescale}/mpc8541cds/init.S (100%) rename board/{cds => freescale}/mpc8548cds/init.S (100%) which is the "expected" answer. However, it might still be better to change the pedantic "exact same basename" on/off choice into a more graduated "how similar are the pathnames" scoring situation, in order to be more likely to get the exact rename choice that people expect to see, rather than other alternatives that may technically be equally good, but are surprising to a human. It's also unclear whether we should consider "basenames are equal" or "have already used this as a source" to be more important. This gives them equal weight, but I suspect we might want to just multiple the "basenames are equal" weight by two, or something, to prefer equal basenames even if that causes a copy/delete pair. I dunno. Anyway, what I'm just saying in a really long-winded manner is that I think this is right as-is, but it's not the complete solution, and it may want some further tweaking in the future. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-11-29 22:30:13 +01:00			`score += basename_same(source, target);`
			`if (score > best_score) {`
			`best = p;`
			`best_score = score;`
			`if (score == 2)`
			`break;`
			`}`
Do linear-time/space rename logic for exact renames This implements a smarter rename detector for exact renames, which rather than doing a pairwise comparison (time O(m*n)) will just hash the files into a hash-table (size O(n+m)), and only do pairwise comparisons to renames that have the same hash (time O(n+m) except for unrealistic hash collissions, which we just cull aggressively). Admittedly the exact rename case is not nearly as interesting as the generic case, but it's an important case none-the-less. A similar general approach should work for the generic case too, but even then you do need to handle the exact renames/copies separately (to avoid the inevitable added cost factor that comes from the _size_ of the file), so this is worth doing. In the expectation that we will indeed do the same hashing trick for the general rename case, this code uses a generic hash-table implementation that can be used for other things too. In fact, we might be able to consolidate some of our existing hash tables with the new generic code in hash.[ch]. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:23:26 +02:00
			`/* Too many identical alternatives? Pick one */`
			`if (!--i)`
			`break;`
			`}`
			`if (best) {`
			`record_rename_pair(dst->index, best->index, MAX_SCORE);`
			`renames++;`
			`}`
			`} while ((dst = dst->next) != NULL);`
			`return renames;`
			`}`

			`static void free_similarity_list(struct file_similarity *p)`
			`{`
			`while (p) {`
			`struct file_similarity *entry = p;`
			`p = p->next;`
			`free(entry);`
			`}`
			`}`

for_each_hash: allow passing a 'void *data' pointer to callback For the find_exact_renames() function, this allows us to pass the diff_options structure pointer to the low-level routines. We will use that to distinguish between the "rename" and "copy" cases. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-02-19 04:55:19 +01:00			`static int find_same_files(void ptr, void data)`
Do linear-time/space rename logic for exact renames This implements a smarter rename detector for exact renames, which rather than doing a pairwise comparison (time O(m*n)) will just hash the files into a hash-table (size O(n+m)), and only do pairwise comparisons to renames that have the same hash (time O(n+m) except for unrealistic hash collissions, which we just cull aggressively). Admittedly the exact rename case is not nearly as interesting as the generic case, but it's an important case none-the-less. A similar general approach should work for the generic case too, but even then you do need to handle the exact renames/copies separately (to avoid the inevitable added cost factor that comes from the _size_ of the file), so this is worth doing. In the expectation that we will indeed do the same hashing trick for the general rename case, this code uses a generic hash-table implementation that can be used for other things too. In fact, we might be able to consolidate some of our existing hash tables with the new generic code in hash.[ch]. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:23:26 +02:00			`{`
			`int ret;`
			`struct file_similarity *p = ptr;`
			`struct file_similarity src = NULL, dst = NULL;`
for_each_hash: allow passing a 'void *data' pointer to callback For the find_exact_renames() function, this allows us to pass the diff_options structure pointer to the low-level routines. We will use that to distinguish between the "rename" and "copy" cases. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-02-19 04:55:19 +01:00			`struct diff_options *options = data;`
Do linear-time/space rename logic for exact renames This implements a smarter rename detector for exact renames, which rather than doing a pairwise comparison (time O(m*n)) will just hash the files into a hash-table (size O(n+m)), and only do pairwise comparisons to renames that have the same hash (time O(n+m) except for unrealistic hash collissions, which we just cull aggressively). Admittedly the exact rename case is not nearly as interesting as the generic case, but it's an important case none-the-less. A similar general approach should work for the generic case too, but even then you do need to handle the exact renames/copies separately (to avoid the inevitable added cost factor that comes from the _size_ of the file), so this is worth doing. In the expectation that we will indeed do the same hashing trick for the general rename case, this code uses a generic hash-table implementation that can be used for other things too. In fact, we might be able to consolidate some of our existing hash tables with the new generic code in hash.[ch]. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:23:26 +02:00
			`/* Split the hash list up into sources and destinations */`
			`do {`
			`struct file_similarity *entry = p;`
			`p = p->next;`
			`if (entry->src_dst < 0) {`
			`entry->next = src;`
			`src = entry;`
			`} else {`
			`entry->next = dst;`
			`dst = entry;`
			`}`
			`} while (p);`

			`/*`
			`* If we have both sources and destinations, see if`
			`* we can match them up`
			`*/`
for_each_hash: allow passing a 'void *data' pointer to callback For the find_exact_renames() function, this allows us to pass the diff_options structure pointer to the low-level routines. We will use that to distinguish between the "rename" and "copy" cases. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-02-19 04:55:19 +01:00			`ret = (src && dst) ? find_identical_files(src, dst, options) : 0;`
Do linear-time/space rename logic for exact renames This implements a smarter rename detector for exact renames, which rather than doing a pairwise comparison (time O(m*n)) will just hash the files into a hash-table (size O(n+m)), and only do pairwise comparisons to renames that have the same hash (time O(n+m) except for unrealistic hash collissions, which we just cull aggressively). Admittedly the exact rename case is not nearly as interesting as the generic case, but it's an important case none-the-less. A similar general approach should work for the generic case too, but even then you do need to handle the exact renames/copies separately (to avoid the inevitable added cost factor that comes from the _size_ of the file), so this is worth doing. In the expectation that we will indeed do the same hashing trick for the general rename case, this code uses a generic hash-table implementation that can be used for other things too. In fact, we might be able to consolidate some of our existing hash tables with the new generic code in hash.[ch]. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:23:26 +02:00
			`/* Free the hashes and return the number of renames found */`
			`free_similarity_list(src);`
			`free_similarity_list(dst);`
			`return ret;`
			`}`

			`static unsigned int hash_filespec(struct diff_filespec *filespec)`
			`{`
			`unsigned int hash;`
			`if (!filespec->sha1_valid) {`
			`if (diff_populate_filespec(filespec, 0))`
			`return 0;`
			`hash_sha1_file(filespec->data, filespec->size, "blob", filespec->sha1);`
			`}`
			`memcpy(&hash, filespec->sha1, sizeof(hash));`
			`return hash;`
			`}`

			`static void insert_file_table(struct hash_table table, int src_dst, int index, struct diff_filespec filespec)`
			`{`
			`void **pos;`
			`unsigned int hash;`
			`struct file_similarity entry = xmalloc(sizeof(entry));`

			`entry->src_dst = src_dst;`
			`entry->index = index;`
			`entry->filespec = filespec;`
			`entry->next = NULL;`

			`hash = hash_filespec(filespec);`
			`pos = insert_hash(hash, entry, table);`

			`/* We already had an entry there? */`
			`if (pos) {`
			`entry->next = *pos;`
			`*pos = entry;`
			`}`
			`}`

Split out "exact content match" phase of rename detection This makes the exact content match a separate function of its own. Partly to cut down a bit on the size of the diffcore_rename() function (which is too complex as it is), and partly because there are smarter ways to do this than an O(m*n) loop over it all, and that function should be rewritten to take that into account. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:17:55 +02:00			`/*`
			`* Find exact renames first.`
			`*`
			`* The first round matches up the up-to-date entries,`
			`* and then during the second round we try to match`
			`* cache-dirty entries as well.`
			`*/`
for_each_hash: allow passing a 'void *data' pointer to callback For the find_exact_renames() function, this allows us to pass the diff_options structure pointer to the low-level routines. We will use that to distinguish between the "rename" and "copy" cases. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-02-19 04:55:19 +01:00			`static int find_exact_renames(struct diff_options *options)`
Split out "exact content match" phase of rename detection This makes the exact content match a separate function of its own. Partly to cut down a bit on the size of the diffcore_rename() function (which is too complex as it is), and partly because there are smarter ways to do this than an O(m*n) loop over it all, and that function should be rewritten to take that into account. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:17:55 +02:00			`{`
Do linear-time/space rename logic for exact renames This implements a smarter rename detector for exact renames, which rather than doing a pairwise comparison (time O(m*n)) will just hash the files into a hash-table (size O(n+m)), and only do pairwise comparisons to renames that have the same hash (time O(n+m) except for unrealistic hash collissions, which we just cull aggressively). Admittedly the exact rename case is not nearly as interesting as the generic case, but it's an important case none-the-less. A similar general approach should work for the generic case too, but even then you do need to handle the exact renames/copies separately (to avoid the inevitable added cost factor that comes from the _size_ of the file), so this is worth doing. In the expectation that we will indeed do the same hashing trick for the general rename case, this code uses a generic hash-table implementation that can be used for other things too. In fact, we might be able to consolidate some of our existing hash tables with the new generic code in hash.[ch]. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:23:26 +02:00			`int i;`
			`struct hash_table file_table;`
Split out "exact content match" phase of rename detection This makes the exact content match a separate function of its own. Partly to cut down a bit on the size of the diffcore_rename() function (which is too complex as it is), and partly because there are smarter ways to do this than an O(m*n) loop over it all, and that function should be rewritten to take that into account. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:17:55 +02:00
Do linear-time/space rename logic for exact renames This implements a smarter rename detector for exact renames, which rather than doing a pairwise comparison (time O(m*n)) will just hash the files into a hash-table (size O(n+m)), and only do pairwise comparisons to renames that have the same hash (time O(n+m) except for unrealistic hash collissions, which we just cull aggressively). Admittedly the exact rename case is not nearly as interesting as the generic case, but it's an important case none-the-less. A similar general approach should work for the generic case too, but even then you do need to handle the exact renames/copies separately (to avoid the inevitable added cost factor that comes from the _size_ of the file), so this is worth doing. In the expectation that we will indeed do the same hashing trick for the general rename case, this code uses a generic hash-table implementation that can be used for other things too. In fact, we might be able to consolidate some of our existing hash tables with the new generic code in hash.[ch]. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:23:26 +02:00			`init_hash(&file_table);`
Preallocate hash tables when the number of inserts are known in advance This avoids unnecessary re-allocations and reinsertions. On webkit.git (i.e. about 182k inserts to the name hash table), this reduces about 100ms out of 3s user time. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2013-03-17 04:28:06 +01:00			`preallocate_hash(&file_table, rename_src_nr + rename_dst_nr);`
Do linear-time/space rename logic for exact renames This implements a smarter rename detector for exact renames, which rather than doing a pairwise comparison (time O(m*n)) will just hash the files into a hash-table (size O(n+m)), and only do pairwise comparisons to renames that have the same hash (time O(n+m) except for unrealistic hash collissions, which we just cull aggressively). Admittedly the exact rename case is not nearly as interesting as the generic case, but it's an important case none-the-less. A similar general approach should work for the generic case too, but even then you do need to handle the exact renames/copies separately (to avoid the inevitable added cost factor that comes from the _size_ of the file), so this is worth doing. In the expectation that we will indeed do the same hashing trick for the general rename case, this code uses a generic hash-table implementation that can be used for other things too. In fact, we might be able to consolidate some of our existing hash tables with the new generic code in hash.[ch]. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:23:26 +02:00			`for (i = 0; i < rename_src_nr; i++)`
diffcore-rename: record filepair for rename src This will allow us to later skip unmodified entries added due to "-C -C". We might also want to do something similar to rename_dst side, but that would only be for the sake of symmetry and not necessary for this series. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-01-06 22:50:05 +01:00			`insert_file_table(&file_table, -1, i, rename_src[i].p->one);`
Do linear-time/space rename logic for exact renames This implements a smarter rename detector for exact renames, which rather than doing a pairwise comparison (time O(m*n)) will just hash the files into a hash-table (size O(n+m)), and only do pairwise comparisons to renames that have the same hash (time O(n+m) except for unrealistic hash collissions, which we just cull aggressively). Admittedly the exact rename case is not nearly as interesting as the generic case, but it's an important case none-the-less. A similar general approach should work for the generic case too, but even then you do need to handle the exact renames/copies separately (to avoid the inevitable added cost factor that comes from the _size_ of the file), so this is worth doing. In the expectation that we will indeed do the same hashing trick for the general rename case, this code uses a generic hash-table implementation that can be used for other things too. In fact, we might be able to consolidate some of our existing hash tables with the new generic code in hash.[ch]. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:23:26 +02:00
			`for (i = 0; i < rename_dst_nr; i++)`
			`insert_file_table(&file_table, 1, i, rename_dst[i].two);`

			`/* Find the renames */`
for_each_hash: allow passing a 'void *data' pointer to callback For the find_exact_renames() function, this allows us to pass the diff_options structure pointer to the low-level routines. We will use that to distinguish between the "rename" and "copy" cases. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-02-19 04:55:19 +01:00			`i = for_each_hash(&file_table, find_same_files, options);`
Do linear-time/space rename logic for exact renames This implements a smarter rename detector for exact renames, which rather than doing a pairwise comparison (time O(m*n)) will just hash the files into a hash-table (size O(n+m)), and only do pairwise comparisons to renames that have the same hash (time O(n+m) except for unrealistic hash collissions, which we just cull aggressively). Admittedly the exact rename case is not nearly as interesting as the generic case, but it's an important case none-the-less. A similar general approach should work for the generic case too, but even then you do need to handle the exact renames/copies separately (to avoid the inevitable added cost factor that comes from the _size_ of the file), so this is worth doing. In the expectation that we will indeed do the same hashing trick for the general rename case, this code uses a generic hash-table implementation that can be used for other things too. In fact, we might be able to consolidate some of our existing hash tables with the new generic code in hash.[ch]. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:23:26 +02:00
			`/* .. and free the hash data structure */`
			`free_hash(&file_table);`

			`return i;`
Split out "exact content match" phase of rename detection This makes the exact content match a separate function of its own. Partly to cut down a bit on the size of the diffcore_rename() function (which is too complex as it is), and partly because there are smarter ways to do this than an O(m*n) loop over it all, and that function should be rewritten to take that into account. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:17:55 +02:00			`}`

Optimize rename detection for a huge diff When there are N deleted paths and M created paths, we used to allocate (N x M) "struct diff_score" that record how similar each of the pair is, and picked the <src,dst> pair that gives the best match first, and then went on to process worse matches. This sorting is done so that when two new files in the postimage that are similar to the same file deleted from the preimage, we can process the more similar one first, and when processing the second one, it can notice "Ah, the source I was planning to say I am a copy of is already taken by somebody else" and continue on to match itself with another file in the preimage with a lessor match. This matters to a change introduced between 1.5.3.X series and 1.5.4-rc, that lets the code to favor unused matches first and then falls back to using already used matches. This instead allocates and keeps only a handful rename source candidates per new files in the postimage. I.e. it makes the memory requirement from O(N x M) to O(M). For each dst, we compute similarlity with all sources (i.e. the number of similarity estimate computations is still O(N x M)), but we keep handful best src candidates for each dst. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2008-01-30 05:54:56 +01:00			`#define NUM_CANDIDATE_PER_DST 4`
			`static void record_if_better(struct diff_score m[], struct diff_score *o)`
			`{`
			`int i, worst;`

			`/* find the worst one */`
			`worst = 0;`
			`for (i = 1; i < NUM_CANDIDATE_PER_DST; i++)`
			`if (score_compare(&m[i], &m[worst]) > 0)`
			`worst = i;`

			`/* is it better than the worst one? */`
			`if (score_compare(&m[worst], o) > 0)`
			`m[worst] = *o;`
			`}`

diffcore-rename: fall back to -C when -C -C busts the rename limit When there are too many paths in the project, the number of rename source candidates "git diff -C -C" finds will exceed the rename detection limit, and no inexact rename detection is performed. We however could fall back to "git diff -C" if the number of modified paths is sufficiently small. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-01-06 22:50:06 +01:00			`/*`
			`* Returns:`
			`* 0 if we are under the limit;`
			`* 1 if we need to disable inexact rename detection;`
			`* 2 if we would be under the limit if we were given -C instead of -C -C.`
			`*/`
diffcore-rename: refactor "too many candidates" logic Move the logic to a separate function, to be enhanced by later patches in the series. While at it, swap the condition used in the if statement from "if it is too big then do this" to "if it would fit then do this". Signed-off-by: Junio C Hamano <gitster@pobox.com> --- Rebased to 'master' as the logic to use the result of this logic was updated recently, together with the addition of eye-candy. 2011-01-06 22:50:04 +01:00			`static int too_many_rename_candidates(int num_create,`
			`struct diff_options *options)`
			`{`
			`int rename_limit = options->rename_limit;`
			`int num_src = rename_src_nr;`
diffcore-rename: fall back to -C when -C -C busts the rename limit When there are too many paths in the project, the number of rename source candidates "git diff -C -C" finds will exceed the rename detection limit, and no inexact rename detection is performed. We however could fall back to "git diff -C" if the number of modified paths is sufficiently small. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-01-06 22:50:06 +01:00			`int i;`
diffcore-rename: refactor "too many candidates" logic Move the logic to a separate function, to be enhanced by later patches in the series. While at it, swap the condition used in the if statement from "if it is too big then do this" to "if it would fit then do this". Signed-off-by: Junio C Hamano <gitster@pobox.com> --- Rebased to 'master' as the logic to use the result of this logic was updated recently, together with the addition of eye-candy. 2011-01-06 22:50:04 +01:00
			`options->needed_rename_limit = 0;`

			`/*`
			`* This basically does a test for the rename matrix not`
			`* growing larger than a "rename_limit" square matrix, ie:`
			`*`
			`* num_create * num_src > rename_limit * rename_limit`
			`*`
			`* but handles the potential overflow case specially (and we`
			`* assume at least 32-bit integers)`
			`*/`
			`if (rename_limit <= 0 \|\| rename_limit > 32767)`
			`rename_limit = 32767;`
			`if ((num_create <= rename_limit \|\| num_src <= rename_limit) &&`
			`(num_create * num_src <= rename_limit * rename_limit))`
			`return 0;`

			`options->needed_rename_limit =`
			`num_src > num_create ? num_src : num_create;`
diffcore-rename: fall back to -C when -C -C busts the rename limit When there are too many paths in the project, the number of rename source candidates "git diff -C -C" finds will exceed the rename detection limit, and no inexact rename detection is performed. We however could fall back to "git diff -C" if the number of modified paths is sufficiently small. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-01-06 22:50:06 +01:00
			`/* Are we running under -C -C? */`
			`if (!DIFF_OPT_TST(options, FIND_COPIES_HARDER))`
			`return 1;`

			`/* Would we bust the limit if we were running under -C? */`
			`for (num_src = i = 0; i < rename_src_nr; i++) {`
			`if (diff_unmodified_pair(rename_src[i].p))`
			`continue;`
			`num_src++;`
			`}`
			`if ((num_create <= rename_limit \|\| num_src <= rename_limit) &&`
			`(num_create * num_src <= rename_limit * rename_limit))`
			`return 2;`
diffcore-rename: refactor "too many candidates" logic Move the logic to a separate function, to be enhanced by later patches in the series. While at it, swap the condition used in the if statement from "if it is too big then do this" to "if it would fit then do this". Signed-off-by: Junio C Hamano <gitster@pobox.com> --- Rebased to 'master' as the logic to use the result of this logic was updated recently, together with the addition of eye-candy. 2011-01-06 22:50:04 +01:00			`return 1;`
			`}`

diffcore-rename: properly honor the difference between -M and -C We would allow rename detection to do copy detection even when asked purely for renames. That confuses users, but more importantly it can terminally confuse the recursive merge rename logic. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-02-19 05:10:32 +01:00			`static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, int copies)`
			`{`
			`int count = 0, i;`

			`for (i = 0; i < dst_cnt * NUM_CANDIDATE_PER_DST; i++) {`
			`struct diff_rename_dst *dst;`

			`if ((mx[i].dst < 0) \|\|`
			`(mx[i].score < minimum_score))`
			`break; /* there is no more usable pair. */`
			`dst = &rename_dst[mx[i].dst];`
			`if (dst->pair)`
			`continue; /* already done, either exact or fuzzy. */`
diffcore-rename: record filepair for rename src This will allow us to later skip unmodified entries added due to "-C -C". We might also want to do something similar to rename_dst side, but that would only be for the sake of symmetry and not necessary for this series. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-01-06 22:50:05 +01:00			`if (!copies && rename_src[mx[i].src].p->one->rename_used)`
diffcore-rename: properly honor the difference between -M and -C We would allow rename detection to do copy detection even when asked purely for renames. That confuses users, but more importantly it can terminally confuse the recursive merge rename logic. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-02-19 05:10:32 +01:00			`continue;`
			`record_rename_pair(mx[i].dst, mx[i].src, mx[i].score);`
			`count++;`
			`}`
			`return count;`
			`}`

Diff: -l<num> to limit rename/copy detection. When many paths are modified, rename detection takes a lot of time. The new option -l<num> can be used to disable rename detection when more than <num> paths are possibly created as renames. Signed-off-by: Junio C Hamano <junkio@cox.net> 2005-09-21 09:18:27 +02:00			`void diffcore_rename(struct diff_options *options)`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`{`
Diff: -l<num> to limit rename/copy detection. When many paths are modified, rename detection takes a lot of time. The new option -l<num> can be used to disable rename detection when more than <num> paths are possibly created as renames. Signed-off-by: Junio C Hamano <junkio@cox.net> 2005-09-21 09:18:27 +02:00			`int detect_rename = options->detect_rename;`
			`int minimum_score = options->rename_score;`
[PATCH] Prepare diffcore interface for diff-tree header supression. This does not actually supress the extra headers when pickaxe is used, but prepares enough support for diff-tree to implement it. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-22 04:40:36 +02:00			`struct diff_queue_struct *q = &diff_queued_diff;`
Plug diff leaks. It is a bit embarrassing that it took this long for a fix since the problem was first reported on Aug 13th. Message-ID: <87y876gl1r.wl@mail2.atmark-techno.com> From: Yasushi SHOJI <yashi@atmark-techno.com> Newsgroups: gmane.comp.version-control.git Subject: [patch] possible memory leak in diff.c::diff_free_filepair() Date: Sat, 13 Aug 2005 19:58:56 +0900 This time I used valgrind to make sure that it does not overeagerly discard memory that is still being used. Signed-off-by: Junio C Hamano <junkio@cox.net> 2005-09-16 01:13:43 +02:00			`struct diff_queue_struct outq;`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`struct diff_score *mx;`
diffcore-rename: fall back to -C when -C -C busts the rename limit When there are too many paths in the project, the number of rename source candidates "git diff -C -C" finds will exceed the rename detection limit, and no inexact rename detection is performed. We however could fall back to "git diff -C" if the number of modified paths is sufficiently small. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-01-06 22:50:06 +01:00			`int i, j, rename_count, skip_unmodified = 0;`
diffcore-rename.c: avoid set-but-not-used warning Since 9d8a5a5 (diffcore-rename: refactor "too many candidates" logic, 2011-01-06), diffcore_rename() initializes num_src but does not use it anymore. "-Wunused-but-set-variable" in gcc-4.6 complains about this. Signed-off-by: Jim Meyering <meyering@redhat.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-04-29 11:42:41 +02:00			`int num_create, dst_cnt;`
add inexact rename detection progress infrastructure We might spend many seconds doing inexact rename detection with no output. It's nice to let the user know that something is actually happening. This patch adds the infrastructure, but no callers actually turn on progress reporting. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-02-20 10:51:16 +01:00			`struct progress *progress = NULL;`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00
[PATCH] Add the code to set default minimum score back in. When the minimum score is specified as 0 (meaning "use default value"), set it to the default as we are told. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-22 08:33:32 +02:00			`if (!minimum_score)`
[PATCH] Add -B flag to diff-* brothers. A new diffcore transformation, diffcore-break.c, is introduced. When the -B flag is given, a patch that represents a complete rewrite is broken into a deletion followed by a creation. This makes it easier to review such a complete rewrite patch. The -B flag takes the same syntax as the -M and -C flags to specify the minimum amount of non-source material the resulting file needs to have to be considered a complete rewrite, and defaults to 99% if not specified. As the new test t4008-diff-break-rewrite.sh demonstrates, if a file is a complete rewrite, it is broken into a delete/create pair, which can further be subjected to the usual rename detection if -M or -C is used. For example, if file0 gets completely rewritten to make it as if it were rather based on file1 which itself disappeared, the following happens: The original change looks like this: file0 --> file0' (quite different from file0) file1 --> /dev/null After diffcore-break runs, it would become this: file0 --> /dev/null /dev/null --> file0' file1 --> /dev/null Then diffcore-rename matches them up: file1 --> file0' The internal score values are finer grained now. Earlier maximum of 10000 has been raised to 60000; there is no user visible changes but there is no reason to waste available bits. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-30 09:08:37 +02:00			`minimum_score = DEFAULT_RENAME_SCORE;`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00
			`for (i = 0; i < q->nr; i++) {`
[PATCH] Introducing software archaeologist's tool "pickaxe". This steals the "pickaxe" feature from JIT and make it available to the bare Plumbing layer. From the command line, the user gives a string he is intersted in. Using the diff-core infrastructure previously introduced, it filters the differences to limit the output only to the diffs between <src> and <dst> where the string appears only in one but not in the other. For example: $ ./git-rev-list HEAD \| ./git-diff-tree -Sdiff-tree-helper --stdin -M would show the diffs that touch the string "diff-tree-helper". In real software-archaeologist application, you would typically look for a few to several lines of code and see where that code came from. The "pickaxe" module runs after "rename/copy detection" module, so it even crosses the file rename boundary, as the above example demonstrates. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:40:01 +02:00			`struct diff_filepair *p = q->queue[i];`
git-pickaxe: rename detection optimization The idea is that we are interested in renaming into only one path, so we do not care about renames that happen elsewhere. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-11-02 09:02:11 +01:00			`if (!DIFF_FILE_VALID(p->one)) {`
[PATCH] The diff-raw format updates. Update the diff-raw format as Linus and I discussed, except that it does not use sequence of underscore '_' letters to express nonexistence. All '0' mode is used for that purpose instead. The new diff-raw format can express rename/copy, and the earlier restriction that -M and -C _must_ be used with the patch format output is no longer necessary. The patch makes -M and -C flags independent of -p flag, so you need to say git-whatchanged -M -p to get the diff/patch format. Updated are both documentations and tests. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-22 04:42:18 +02:00			`if (!DIFF_FILE_VALID(p->two))`
[PATCH] Be careful with symlinks when detecting renames and copies. Earlier round was not treating symbolic links carefully enough, and would have produced diff output that renamed/copied then edited the contents of a symbolic link, which made no practical sense. Change it to detect only pure renames. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-23 06:24:49 +02:00			`continue; /* unmerged */`
git-pickaxe: rename detection optimization The idea is that we are interested in renaming into only one path, so we do not care about renames that happen elsewhere. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-11-02 09:02:11 +01:00			`else if (options->single_follow &&`
			`strcmp(options->single_follow, p->two->path))`
			`continue; /* not interested */`
teach diffcore-rename to optionally ignore empty content Our rename detection is a heuristic, matching pairs of removed and added files with similar or identical content. It's unlikely to be wrong when there is actual content to compare, and we already take care not to do inexact rename detection when there is not enough content to produce good results. However, we always do exact rename detection, even when the blob is tiny or empty. It's easy to get false positives with an empty blob, simply because it is an obvious content to use as a boilerplate (e.g., when telling git that an empty directory is worth tracking via an empty .gitignore). This patch lets callers specify whether or not they are interested in using empty files as rename sources and destinations. The default is "yes", keeping the original behavior. It works by detecting the empty-blob sha1 for rename sources and destinations. One more flexible alternative would be to allow the caller to specify a minimum size for a blob to be "interesting" for rename detection. But that would catch small boilerplate files, not large ones (e.g., if you had the GPL COPYING file in many directories). A better alternative would be to allow a "-rename" gitattribute to allow boilerplate files to be marked as such. I'll leave the complexity of that solution until such time as somebody actually wants it. The complaints we've seen so far revolve around empty files, so let's start with the simple thing. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2012-03-22 23:52:13 +01:00			`else if (!DIFF_OPT_TST(options, RENAME_EMPTY) &&`
			`is_empty_blob_sha1(p->two->sha1))`
			`continue;`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`else`
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`locate_rename_dst(p->two, 1);`
git-pickaxe: rename detection optimization The idea is that we are interested in renaming into only one path, so we do not care about renames that happen elsewhere. Signed-off-by: Junio C Hamano <junkio@cox.net> 2006-11-02 09:02:11 +01:00			`}`
teach diffcore-rename to optionally ignore empty content Our rename detection is a heuristic, matching pairs of removed and added files with similar or identical content. It's unlikely to be wrong when there is actual content to compare, and we already take care not to do inexact rename detection when there is not enough content to produce good results. However, we always do exact rename detection, even when the blob is tiny or empty. It's easy to get false positives with an empty blob, simply because it is an obvious content to use as a boilerplate (e.g., when telling git that an empty directory is worth tracking via an empty .gitignore). This patch lets callers specify whether or not they are interested in using empty files as rename sources and destinations. The default is "yes", keeping the original behavior. It works by detecting the empty-blob sha1 for rename sources and destinations. One more flexible alternative would be to allow the caller to specify a minimum size for a blob to be "interesting" for rename detection. But that would catch small boilerplate files, not large ones (e.g., if you had the GPL COPYING file in many directories). A better alternative would be to allow a "-rename" gitattribute to allow boilerplate files to be marked as such. I'll leave the complexity of that solution until such time as somebody actually wants it. The complaints we've seen so far revolve around empty files, so let's start with the simple thing. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2012-03-22 23:52:13 +01:00			`else if (!DIFF_OPT_TST(options, RENAME_EMPTY) &&`
			`is_empty_blob_sha1(p->one->sha1))`
			`continue;`
diffcore-rename: don't consider unmerged path as source Since e9c8409 (diff-index --cached --raw: show tree entry on the LHS for unmerged entries., 2007-01-05), an unmerged entry should be detected by using DIFF_PAIR_UNMERGED(p), not by noticing both one and two sides of the filepair records mode=0 entries. However, it forgot to update some parts of the rename detection logic. This only makes difference in the "diff --cached" codepath where an unmerged filepair carries information on the entries that came from the tree. It probably hasn't been noticed for a long time because nobody would run "diff -M" during a conflict resolution, but "git status" uses rename detection when it internally runs "diff-index" and "diff-files" and gives nonsense results. In an unmerged pair, "one" side can have a valid filespec to record the tree entry (e.g. what's in HEAD) when running "diff --cached". This can be used as a rename source to other paths in the index that are not unmerged. The path that is unmerged by definition does not have the final content yet (i.e. "two" side cannot have a valid filespec), so it can never be a rename destination. Use the DIFF_PAIR_UNMERGED() to detect unmerged filepair correctly, and allow the valid "one" side of an unmerged filepair to be considered a potential rename source, but never to be considered a rename destination. Commit message and first two test cases by Junio, the rest by Martin. Signed-off-by: Martin von Zweigbergk <martin.von.zweigbergk@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-03-24 03:41:01 +01:00			`else if (!DIFF_PAIR_UNMERGED(p) && !DIFF_FILE_VALID(p->two)) {`
copy vs rename detection: avoid unnecessary O(n*m) loops The core rename detection had some rather stupid code to check if a pathname was used by a later modification or rename, which basically walked the whole pathname space for all renames for each rename, in order to tell whether it was a pure rename (no remaining users) or should be considered a copy (other users of the source file remaining). That's really silly, since we can just keep a count of users around, and replace all those complex and expensive loops with just testing that simple counter (but this all depends on the previous commit that shared the diff_filespec data structure by using a separate reference count). Note that the reference count is not the same as the rename count: they behave otherwise rather similarly, but the reference count is tied to the allocation (and decremented at de-allocation, so that when it turns zero we can get rid of the memory), while the rename count is tied to the renames and is decremented when we find a rename (so that when it turns zero we know that it was a rename, not a copy). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:20:56 +02:00			`/*`
			`* If the source is a broken "delete", and`
[PATCH] Fix rename/copy when dealing with temporarily broken pairs. When rename/copy uses a file that was broken by diffcore-break as the source, and the broken filepair gets merged back later, the output was mislabeled as a rename. In this case, the source file ends up staying in the output, so we should label it as a copy instead. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-06-12 05:55:20 +02:00			`* they did not really want to get broken,`
			`* that means the source actually stays.`
copy vs rename detection: avoid unnecessary O(n*m) loops The core rename detection had some rather stupid code to check if a pathname was used by a later modification or rename, which basically walked the whole pathname space for all renames for each rename, in order to tell whether it was a pure rename (no remaining users) or should be considered a copy (other users of the source file remaining). That's really silly, since we can just keep a count of users around, and replace all those complex and expensive loops with just testing that simple counter (but this all depends on the previous commit that shared the diff_filespec data structure by using a separate reference count). Note that the reference count is not the same as the rename count: they behave otherwise rather similarly, but the reference count is tied to the allocation (and decremented at de-allocation, so that when it turns zero we can get rid of the memory), while the rename count is tied to the renames and is decremented when we find a rename (so that when it turns zero we know that it was a rename, not a copy). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:20:56 +02:00			`* So we increment the "rename_used" score`
			`* by one, to indicate ourselves as a user`
			`*/`
			`if (p->broken_pair && !p->score)`
			`p->one->rename_used++;`
diffcore-rename: record filepair for rename src This will allow us to later skip unmodified entries added due to "-C -C". We might also want to do something similar to rename_dst side, but that would only be for the sake of symmetry and not necessary for this series. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-01-06 22:50:05 +01:00			`register_rename_src(p);`
copy vs rename detection: avoid unnecessary O(n*m) loops The core rename detection had some rather stupid code to check if a pathname was used by a later modification or rename, which basically walked the whole pathname space for all renames for each rename, in order to tell whether it was a pure rename (no remaining users) or should be considered a copy (other users of the source file remaining). That's really silly, since we can just keep a count of users around, and replace all those complex and expensive loops with just testing that simple counter (but this all depends on the previous commit that shared the diff_filespec data structure by using a separate reference count). Note that the reference count is not the same as the rename count: they behave otherwise rather similarly, but the reference count is tied to the allocation (and decremented at de-allocation, so that when it turns zero we can get rid of the memory), while the rename count is tied to the renames and is decremented when we find a rename (so that when it turns zero we know that it was a rename, not a copy). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:20:56 +02:00			`}`
			`else if (detect_rename == DIFF_DETECT_COPY) {`
			`/*`
			`* Increment the "rename_used" score by`
			`* one, to indicate ourselves as a user.`
[PATCH] Fix rename/copy when dealing with temporarily broken pairs. When rename/copy uses a file that was broken by diffcore-break as the source, and the broken filepair gets merged back later, the output was mislabeled as a rename. In this case, the source file ends up staying in the output, so we should label it as a copy instead. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-06-12 05:55:20 +02:00			`*/`
copy vs rename detection: avoid unnecessary O(n*m) loops The core rename detection had some rather stupid code to check if a pathname was used by a later modification or rename, which basically walked the whole pathname space for all renames for each rename, in order to tell whether it was a pure rename (no remaining users) or should be considered a copy (other users of the source file remaining). That's really silly, since we can just keep a count of users around, and replace all those complex and expensive loops with just testing that simple counter (but this all depends on the previous commit that shared the diff_filespec data structure by using a separate reference count). Note that the reference count is not the same as the rename count: they behave otherwise rather similarly, but the reference count is tied to the allocation (and decremented at de-allocation, so that when it turns zero we can get rid of the memory), while the rename count is tied to the renames and is decremented when we find a rename (so that when it turns zero we know that it was a rename, not a copy). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:20:56 +02:00			`p->one->rename_used++;`
diffcore-rename: record filepair for rename src This will allow us to later skip unmodified entries added due to "-C -C". We might also want to do something similar to rename_dst side, but that would only be for the sake of symmetry and not necessary for this series. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-01-06 22:50:05 +01:00			`register_rename_src(p);`
[PATCH] Fix rename/copy when dealing with temporarily broken pairs. When rename/copy uses a file that was broken by diffcore-break as the source, and the broken filepair gets merged back later, the output was mislabeled as a rename. In this case, the source file ends up staying in the output, so we should label it as a copy instead. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-06-12 05:55:20 +02:00			`}`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`}`
Fix the rename detection limit checking This adds more proper rename detection limits. Instead of just checking the limit against the number of potential rename destinations, we verify that the rename matrix (which is what really matters) doesn't grow ridiculously large, and we also make sure that we don't overflow when doing the matrix size calculation. This also changes the default limits from unlimited, to a rename matrix that is limited to 100 entries on a side. You can raise it with the config entry, or by using the "-l<n>" command line flag, but at least the default is now a sane number that avoids spending lots of time (and memory) in situations that likely don't merit it. The choice of default value is of course very debatable. Limiting the rename matrix to a 100x100 size will mean that even if you have just one obvious rename, but you also create (or delete) 10,000 files, the rename matrix will be so big that we disable the heuristics. Sounds reasonable to me, but let's see if people hit this (and, perhaps more importantly, actually care) in real life. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-09-14 19:39:48 +02:00			`if (rename_dst_nr == 0 \|\| rename_src_nr == 0)`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`goto cleanup; /* nothing to do */`

Do exact rename detection regardless of rename limits Now that the exact rename detection is linear-time (with a very small constant factor to boot), there is no longer any reason to limit it by the number of files involved. In some trivial testing, I created a repository with a directory that had a hundred thousand files in it (all with different contents), and then moved that directory to show the effects of renaming 100,000 files. With the new code, that resulted in [torvalds@woody big-rename]$ time ~/git/git show -C \| wc -l 400006 real 0m2.071s user 0m1.520s sys 0m0.576s ie the code can correctly detect the hundred thousand renames in about 2 seconds (the number "400006" comes from four lines for each rename: diff --git a/really-big-dir/file-1-1-1-1-1 b/moved-big-dir/file-1-1-1-1-1 similarity index 100% rename from really-big-dir/file-1-1-1-1-1 rename to moved-big-dir/file-1-1-1-1-1 and the extra six lines is from a one-liner commit message and all the commit information and spacing). Most of those two seconds weren't even really the rename detection, it's really all the other stuff needed to get there. With the old code, this wouldn't have been practically possible. Doing a pairwise check of the ten billion possible pairs would have been prohibitively expensive. In fact, even with the rename limiter in place, the old code would waste a lot of time just on the diff_filespec checks, and despite not even trying to find renames, it used to look like: [torvalds@woody big-rename]$ time git show -C \| wc -l 1400006 real 0m12.337s user 0m12.285s sys 0m0.192s ie we used to take 12 seconds for this load and not even do any rename detection! (The number 1400006 comes from fourteen lines per file moved: seven lines each for the delete and the create of a one-liner file, and the same extra six lines of commit information). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:24:47 +02:00			`/*`
			`* We really want to cull the candidates list early`
			`* with cheap tests in order to avoid doing deltas.`
			`*/`
for_each_hash: allow passing a 'void *data' pointer to callback For the find_exact_renames() function, this allows us to pass the diff_options structure pointer to the low-level routines. We will use that to distinguish between the "rename" and "copy" cases. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-02-19 04:55:19 +01:00			`rename_count = find_exact_renames(options);`
Do exact rename detection regardless of rename limits Now that the exact rename detection is linear-time (with a very small constant factor to boot), there is no longer any reason to limit it by the number of files involved. In some trivial testing, I created a repository with a directory that had a hundred thousand files in it (all with different contents), and then moved that directory to show the effects of renaming 100,000 files. With the new code, that resulted in [torvalds@woody big-rename]$ time ~/git/git show -C \| wc -l 400006 real 0m2.071s user 0m1.520s sys 0m0.576s ie the code can correctly detect the hundred thousand renames in about 2 seconds (the number "400006" comes from four lines for each rename: diff --git a/really-big-dir/file-1-1-1-1-1 b/moved-big-dir/file-1-1-1-1-1 similarity index 100% rename from really-big-dir/file-1-1-1-1-1 rename to moved-big-dir/file-1-1-1-1-1 and the extra six lines is from a one-liner commit message and all the commit information and spacing). Most of those two seconds weren't even really the rename detection, it's really all the other stuff needed to get there. With the old code, this wouldn't have been practically possible. Doing a pairwise check of the ten billion possible pairs would have been prohibitively expensive. In fact, even with the rename limiter in place, the old code would waste a lot of time just on the diff_filespec checks, and despite not even trying to find renames, it used to look like: [torvalds@woody big-rename]$ time git show -C \| wc -l 1400006 real 0m12.337s user 0m12.285s sys 0m0.192s ie we used to take 12 seconds for this load and not even do any rename detection! (The number 1400006 comes from fourteen lines per file moved: seven lines each for the delete and the create of a one-liner file, and the same extra six lines of commit information). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:24:47 +02:00
Do the fuzzy rename detection limits with the exact renames removed When we do the fuzzy rename detection, we don't care about the destinations that we already handled with the exact rename detector. And, in fact, the code already knew that - but the rename limiter, which used to run before exact renames were detected, did not. This fixes it so that the rename detection limiter now bases its decisions on the remaining rename counts, rather than the original ones. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-27 01:56:34 +02:00			`/* Did we only want exact renames? */`
			`if (minimum_score == MAX_SCORE)`
			`goto cleanup;`

			`/*`
			`* Calculate how many renames are left (but all the source`
			`* files still remain as options for rename/copies!)`
			`*/`
			`num_create = (rename_dst_nr - rename_count);`

			`/* All done? */`
			`if (!num_create)`
			`goto cleanup;`

diffcore-rename: fall back to -C when -C -C busts the rename limit When there are too many paths in the project, the number of rename source candidates "git diff -C -C" finds will exceed the rename detection limit, and no inexact rename detection is performed. We however could fall back to "git diff -C" if the number of modified paths is sufficiently small. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-01-06 22:50:06 +01:00			`switch (too_many_rename_candidates(num_create, options)) {`
			`case 1:`
Fix the rename detection limit checking This adds more proper rename detection limits. Instead of just checking the limit against the number of potential rename destinations, we verify that the rename matrix (which is what really matters) doesn't grow ridiculously large, and we also make sure that we don't overflow when doing the matrix size calculation. This also changes the default limits from unlimited, to a rename matrix that is limited to 100 entries on a side. You can raise it with the config entry, or by using the "-l<n>" command line flag, but at least the default is now a sane number that avoids spending lots of time (and memory) in situations that likely don't merit it. The choice of default value is of course very debatable. Limiting the rename matrix to a 100x100 size will mean that even if you have just one obvious rename, but you also create (or delete) 10,000 files, the rename matrix will be so big that we disable the heuristics. Sounds reasonable to me, but let's see if people hit this (and, perhaps more importantly, actually care) in real life. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-09-14 19:39:48 +02:00			`goto cleanup;`
diffcore-rename: fall back to -C when -C -C busts the rename limit When there are too many paths in the project, the number of rename source candidates "git diff -C -C" finds will exceed the rename detection limit, and no inexact rename detection is performed. We however could fall back to "git diff -C" if the number of modified paths is sufficiently small. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-01-06 22:50:06 +01:00			`case 2:`
			`options->degraded_cc_to_c = 1;`
			`skip_unmodified = 1;`
			`break;`
			`default:`
			`break;`
			`}`
Fix the rename detection limit checking This adds more proper rename detection limits. Instead of just checking the limit against the number of potential rename destinations, we verify that the rename matrix (which is what really matters) doesn't grow ridiculously large, and we also make sure that we don't overflow when doing the matrix size calculation. This also changes the default limits from unlimited, to a rename matrix that is limited to 100 entries on a side. You can raise it with the config entry, or by using the "-l<n>" command line flag, but at least the default is now a sane number that avoids spending lots of time (and memory) in situations that likely don't merit it. The choice of default value is of course very debatable. Limiting the rename matrix to a 100x100 size will mean that even if you have just one obvious rename, but you also create (or delete) 10,000 files, the rename matrix will be so big that we disable the heuristics. Sounds reasonable to me, but let's see if people hit this (and, perhaps more importantly, actually care) in real life. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-09-14 19:39:48 +02:00
add inexact rename detection progress infrastructure We might spend many seconds doing inexact rename detection with no output. It's nice to let the user know that something is actually happening. This patch adds the infrastructure, but no callers actually turn on progress reporting. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-02-20 10:51:16 +01:00			`if (options->show_rename_progress) {`
			`progress = start_progress_delay(`
			`"Performing inexact rename detection",`
			`rename_dst_nr * rename_src_nr, 50, 1);`
			`}`

Optimize rename detection for a huge diff When there are N deleted paths and M created paths, we used to allocate (N x M) "struct diff_score" that record how similar each of the pair is, and picked the <src,dst> pair that gives the best match first, and then went on to process worse matches. This sorting is done so that when two new files in the postimage that are similar to the same file deleted from the preimage, we can process the more similar one first, and when processing the second one, it can notice "Ah, the source I was planning to say I am a copy of is already taken by somebody else" and continue on to match itself with another file in the preimage with a lessor match. This matters to a change introduced between 1.5.3.X series and 1.5.4-rc, that lets the code to favor unused matches first and then falls back to using already used matches. This instead allocates and keeps only a handful rename source candidates per new files in the postimage. I.e. it makes the memory requirement from O(N x M) to O(M). For each dst, we compute similarlity with all sources (i.e. the number of similarity estimate computations is still O(N x M)), but we keep handful best src candidates for each dst. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2008-01-30 05:54:56 +01:00			`mx = xcalloc(num_create * NUM_CANDIDATE_PER_DST, sizeof(*mx));`
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`for (dst_cnt = i = 0; i < rename_dst_nr; i++) {`
			`struct diff_filespec *two = rename_dst[i].two;`
Optimize rename detection for a huge diff When there are N deleted paths and M created paths, we used to allocate (N x M) "struct diff_score" that record how similar each of the pair is, and picked the <src,dst> pair that gives the best match first, and then went on to process worse matches. This sorting is done so that when two new files in the postimage that are similar to the same file deleted from the preimage, we can process the more similar one first, and when processing the second one, it can notice "Ah, the source I was planning to say I am a copy of is already taken by somebody else" and continue on to match itself with another file in the preimage with a lessor match. This matters to a change introduced between 1.5.3.X series and 1.5.4-rc, that lets the code to favor unused matches first and then falls back to using already used matches. This instead allocates and keeps only a handful rename source candidates per new files in the postimage. I.e. it makes the memory requirement from O(N x M) to O(M). For each dst, we compute similarlity with all sources (i.e. the number of similarity estimate computations is still O(N x M)), but we keep handful best src candidates for each dst. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2008-01-30 05:54:56 +01:00			`struct diff_score *m;`

[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`if (rename_dst[i].pair)`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`continue; /* dealt with exact match already. */`
Optimize rename detection for a huge diff When there are N deleted paths and M created paths, we used to allocate (N x M) "struct diff_score" that record how similar each of the pair is, and picked the <src,dst> pair that gives the best match first, and then went on to process worse matches. This sorting is done so that when two new files in the postimage that are similar to the same file deleted from the preimage, we can process the more similar one first, and when processing the second one, it can notice "Ah, the source I was planning to say I am a copy of is already taken by somebody else" and continue on to match itself with another file in the preimage with a lessor match. This matters to a change introduced between 1.5.3.X series and 1.5.4-rc, that lets the code to favor unused matches first and then falls back to using already used matches. This instead allocates and keeps only a handful rename source candidates per new files in the postimage. I.e. it makes the memory requirement from O(N x M) to O(M). For each dst, we compute similarlity with all sources (i.e. the number of similarity estimate computations is still O(N x M)), but we keep handful best src candidates for each dst. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2008-01-30 05:54:56 +01:00
			`m = &mx[dst_cnt * NUM_CANDIDATE_PER_DST];`
			`for (j = 0; j < NUM_CANDIDATE_PER_DST; j++)`
			`m[j].dst = -1;`

[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`for (j = 0; j < rename_src_nr; j++) {`
diffcore-rename: record filepair for rename src This will allow us to later skip unmodified entries added due to "-C -C". We might also want to do something similar to rename_dst side, but that would only be for the sake of symmetry and not necessary for this series. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-01-06 22:50:05 +01:00			`struct diff_filespec *one = rename_src[j].p->one;`
Optimize rename detection for a huge diff When there are N deleted paths and M created paths, we used to allocate (N x M) "struct diff_score" that record how similar each of the pair is, and picked the <src,dst> pair that gives the best match first, and then went on to process worse matches. This sorting is done so that when two new files in the postimage that are similar to the same file deleted from the preimage, we can process the more similar one first, and when processing the second one, it can notice "Ah, the source I was planning to say I am a copy of is already taken by somebody else" and continue on to match itself with another file in the preimage with a lessor match. This matters to a change introduced between 1.5.3.X series and 1.5.4-rc, that lets the code to favor unused matches first and then falls back to using already used matches. This instead allocates and keeps only a handful rename source candidates per new files in the postimage. I.e. it makes the memory requirement from O(N x M) to O(M). For each dst, we compute similarlity with all sources (i.e. the number of similarity estimate computations is still O(N x M)), but we keep handful best src candidates for each dst. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2008-01-30 05:54:56 +01:00			`struct diff_score this_src;`
diffcore-rename: fall back to -C when -C -C busts the rename limit When there are too many paths in the project, the number of rename source candidates "git diff -C -C" finds will exceed the rename detection limit, and no inexact rename detection is performed. We however could fall back to "git diff -C" if the number of modified paths is sufficiently small. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-01-06 22:50:06 +01:00
			`if (skip_unmodified &&`
			`diff_unmodified_pair(rename_src[j].p))`
			`continue;`

Optimize rename detection for a huge diff When there are N deleted paths and M created paths, we used to allocate (N x M) "struct diff_score" that record how similar each of the pair is, and picked the <src,dst> pair that gives the best match first, and then went on to process worse matches. This sorting is done so that when two new files in the postimage that are similar to the same file deleted from the preimage, we can process the more similar one first, and when processing the second one, it can notice "Ah, the source I was planning to say I am a copy of is already taken by somebody else" and continue on to match itself with another file in the preimage with a lessor match. This matters to a change introduced between 1.5.3.X series and 1.5.4-rc, that lets the code to favor unused matches first and then falls back to using already used matches. This instead allocates and keeps only a handful rename source candidates per new files in the postimage. I.e. it makes the memory requirement from O(N x M) to O(M). For each dst, we compute similarlity with all sources (i.e. the number of similarity estimate computations is still O(N x M)), but we keep handful best src candidates for each dst. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2008-01-30 05:54:56 +01:00			`this_src.score = estimate_similarity(one, two,`
			`minimum_score);`
			`this_src.name_score = basename_same(one, two);`
			`this_src.dst = i;`
			`this_src.src = j;`
			`record_if_better(m, &this_src);`
diffcore-rename: reduce memory footprint by freeing blob data early After running one round of estimate_similarity(), filespecs on either side will have populated their cnt_data fields, and we do not need the blob text anymore. We used to retain the blob data to optimize for smaller projects (not freeing the blob data here would mean that the final output phase would not have to re-read it), but we are efficient enough without such optimization for smaller projects anyway, and freeing memory early will help larger projects. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-11-21 07:13:47 +01:00			`/*`
			`* Once we run estimate_similarity,`
			`* We do not need the text anymore.`
			`*/`
rename diff_free_filespec_data_large() to diff_free_filespec_blob() Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-03 06:01:03 +02:00			`diff_free_filespec_blob(one);`
diffcore-rename: reduce memory footprint by freeing blob data early After running one round of estimate_similarity(), filespecs on either side will have populated their cnt_data fields, and we do not need the blob text anymore. We used to retain the blob data to optimize for smaller projects (not freeing the blob data here would mean that the final output phase would not have to re-read it), but we are efficient enough without such optimization for smaller projects anyway, and freeing memory early will help larger projects. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2009-11-21 07:13:47 +01:00			`diff_free_filespec_blob(two);`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`}`
			`dst_cnt++;`
add inexact rename detection progress infrastructure We might spend many seconds doing inexact rename detection with no output. It's nice to let the user know that something is actually happening. This patch adds the infrastructure, but no callers actually turn on progress reporting. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-02-20 10:51:16 +01:00			`display_progress(progress, (i+1)*rename_src_nr);`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`}`
add inexact rename detection progress infrastructure We might spend many seconds doing inexact rename detection with no output. It's nice to let the user know that something is actually happening. This patch adds the infrastructure, but no callers actually turn on progress reporting. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-02-20 10:51:16 +01:00			`stop_progress(&progress);`
Optimize rename detection for a huge diff When there are N deleted paths and M created paths, we used to allocate (N x M) "struct diff_score" that record how similar each of the pair is, and picked the <src,dst> pair that gives the best match first, and then went on to process worse matches. This sorting is done so that when two new files in the postimage that are similar to the same file deleted from the preimage, we can process the more similar one first, and when processing the second one, it can notice "Ah, the source I was planning to say I am a copy of is already taken by somebody else" and continue on to match itself with another file in the preimage with a lessor match. This matters to a change introduced between 1.5.3.X series and 1.5.4-rc, that lets the code to favor unused matches first and then falls back to using already used matches. This instead allocates and keeps only a handful rename source candidates per new files in the postimage. I.e. it makes the memory requirement from O(N x M) to O(M). For each dst, we compute similarlity with all sources (i.e. the number of similarity estimate computations is still O(N x M)), but we keep handful best src candidates for each dst. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2008-01-30 05:54:56 +01:00
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`/* cost matrix sorted by most to least similar pair */`
Optimize rename detection for a huge diff When there are N deleted paths and M created paths, we used to allocate (N x M) "struct diff_score" that record how similar each of the pair is, and picked the <src,dst> pair that gives the best match first, and then went on to process worse matches. This sorting is done so that when two new files in the postimage that are similar to the same file deleted from the preimage, we can process the more similar one first, and when processing the second one, it can notice "Ah, the source I was planning to say I am a copy of is already taken by somebody else" and continue on to match itself with another file in the preimage with a lessor match. This matters to a change introduced between 1.5.3.X series and 1.5.4-rc, that lets the code to favor unused matches first and then falls back to using already used matches. This instead allocates and keeps only a handful rename source candidates per new files in the postimage. I.e. it makes the memory requirement from O(N x M) to O(M). For each dst, we compute similarlity with all sources (i.e. the number of similarity estimate computations is still O(N x M)), but we keep handful best src candidates for each dst. Signed-off-by: Junio C Hamano <gitster@pobox.com> 2008-01-30 05:54:56 +01:00			`qsort(mx, dst_cnt * NUM_CANDIDATE_PER_DST, sizeof(*mx), score_compare);`

diffcore-rename: properly honor the difference between -M and -C We would allow rename detection to do copy detection even when asked purely for renames. That confuses users, but more importantly it can terminally confuse the recursive merge rename logic. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-02-19 05:10:32 +01:00			`rename_count += find_renames(mx, dst_cnt, minimum_score, 0);`
			`if (detect_rename == DIFF_DETECT_COPY)`
			`rename_count += find_renames(mx, dst_cnt, minimum_score, 1);`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`free(mx);`

[PATCH] Fix the way diffcore-rename records unremoved source. Earier version of diffcore-rename used to keep unmodified filepair in its output so that the last stage of the processing that tells renames from copies can make all of rename/copy to copies. However this had a bad interaction with other diffcore filters that wanted to run after diffcore-rename, in that such unmodified filepair must be retained for proper distinction between renames and copies to happen. This patch fixes the problem by changing the way diffcore-rename records the information needed to distinguish "all are copies" case and "the last one is a rename" case. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-28 00:55:55 +02:00			`cleanup:`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`/* At this point, we have found some renames and copies and they`
Plug diff leaks. It is a bit embarrassing that it took this long for a fix since the problem was first reported on Aug 13th. Message-ID: <87y876gl1r.wl@mail2.atmark-techno.com> From: Yasushi SHOJI <yashi@atmark-techno.com> Newsgroups: gmane.comp.version-control.git Subject: [patch] possible memory leak in diff.c::diff_free_filepair() Date: Sat, 13 Aug 2005 19:58:56 +0900 This time I used valgrind to make sure that it does not overeagerly discard memory that is still being used. Signed-off-by: Junio C Hamano <junkio@cox.net> 2005-09-16 01:13:43 +02:00			`* are recorded in rename_dst. The original list is still in *q.`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`*/`
Add a macro DIFF_QUEUE_CLEAR. Refactor the diff_queue_struct code, this macro help to reset the structure. Signed-off-by: Bo Yang <struggleyb.nku@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2010-05-07 06:52:27 +02:00			`DIFF_QUEUE_CLEAR(&outq);`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`for (i = 0; i < q->nr; i++) {`
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`struct diff_filepair *p = q->queue[i];`
			`struct diff_filepair *pair_to_free = NULL;`

diffcore-rename: don't consider unmerged path as source Since e9c8409 (diff-index --cached --raw: show tree entry on the LHS for unmerged entries., 2007-01-05), an unmerged entry should be detected by using DIFF_PAIR_UNMERGED(p), not by noticing both one and two sides of the filepair records mode=0 entries. However, it forgot to update some parts of the rename detection logic. This only makes difference in the "diff --cached" codepath where an unmerged filepair carries information on the entries that came from the tree. It probably hasn't been noticed for a long time because nobody would run "diff -M" during a conflict resolution, but "git status" uses rename detection when it internally runs "diff-index" and "diff-files" and gives nonsense results. In an unmerged pair, "one" side can have a valid filespec to record the tree entry (e.g. what's in HEAD) when running "diff --cached". This can be used as a rename source to other paths in the index that are not unmerged. The path that is unmerged by definition does not have the final content yet (i.e. "two" side cannot have a valid filespec), so it can never be a rename destination. Use the DIFF_PAIR_UNMERGED() to detect unmerged filepair correctly, and allow the valid "one" side of an unmerged filepair to be considered a potential rename source, but never to be considered a rename destination. Commit message and first two test cases by Junio, the rest by Martin. Signed-off-by: Martin von Zweigbergk <martin.von.zweigbergk@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2011-03-24 03:41:01 +01:00			`if (DIFF_PAIR_UNMERGED(p)) {`
			`diff_q(&outq, p);`
			`}`
			`else if (!DIFF_FILE_VALID(p->one) && DIFF_FILE_VALID(p->two)) {`
[PATCH] diff: fix the culling of unneeded delete record. The commit 15d061b435a7e3b6bead39df3889f4af78c4b00a [PATCH] Fix the way diffcore-rename records unremoved source. still leaves unneeded delete records in its output stream by mistake, which was covered up by having an extra check to turn such a delete into a no-op downstream. Fix the check in the diffcore-rename to simplify the output routine. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-30 09:08:07 +02:00			`/*`
			`* Creation`
			`*`
			`* We would output this create record if it has`
			`* not been turned into a rename/copy already.`
			`*/`
			`struct diff_rename_dst *dst =`
			`locate_rename_dst(p->two, 0);`
			`if (dst && dst->pair) {`
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`diff_q(&outq, dst->pair);`
			`pair_to_free = p;`
			`}`
			`else`
[PATCH] diff: fix the culling of unneeded delete record. The commit 15d061b435a7e3b6bead39df3889f4af78c4b00a [PATCH] Fix the way diffcore-rename records unremoved source. still leaves unneeded delete records in its output stream by mistake, which was covered up by having an extra check to turn such a delete into a no-op downstream. Fix the check in the diffcore-rename to simplify the output routine. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-30 09:08:07 +02:00			`/* no matching rename/copy source, so`
			`* record this as a creation.`
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`*/`
			`diff_q(&outq, p);`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`}`
[PATCH] diff: fix the culling of unneeded delete record. The commit 15d061b435a7e3b6bead39df3889f4af78c4b00a [PATCH] Fix the way diffcore-rename records unremoved source. still leaves unneeded delete records in its output stream by mistake, which was covered up by having an extra check to turn such a delete into a no-op downstream. Fix the check in the diffcore-rename to simplify the output routine. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-30 09:08:07 +02:00			`else if (DIFF_FILE_VALID(p->one) && !DIFF_FILE_VALID(p->two)) {`
			`/*`
			`* Deletion`
			`*`
[PATCH] Add -B flag to diff-* brothers. A new diffcore transformation, diffcore-break.c, is introduced. When the -B flag is given, a patch that represents a complete rewrite is broken into a deletion followed by a creation. This makes it easier to review such a complete rewrite patch. The -B flag takes the same syntax as the -M and -C flags to specify the minimum amount of non-source material the resulting file needs to have to be considered a complete rewrite, and defaults to 99% if not specified. As the new test t4008-diff-break-rewrite.sh demonstrates, if a file is a complete rewrite, it is broken into a delete/create pair, which can further be subjected to the usual rename detection if -M or -C is used. For example, if file0 gets completely rewritten to make it as if it were rather based on file1 which itself disappeared, the following happens: The original change looks like this: file0 --> file0' (quite different from file0) file1 --> /dev/null After diffcore-break runs, it would become this: file0 --> /dev/null /dev/null --> file0' file1 --> /dev/null Then diffcore-rename matches them up: file1 --> file0' The internal score values are finer grained now. Earlier maximum of 10000 has been raised to 60000; there is no user visible changes but there is no reason to waste available bits. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-30 09:08:37 +02:00			`* We would output this delete record if:`
			`*`
			`* (1) this is a broken delete and the counterpart`
			`* broken create remains in the output; or`
Plug diff leaks. It is a bit embarrassing that it took this long for a fix since the problem was first reported on Aug 13th. Message-ID: <87y876gl1r.wl@mail2.atmark-techno.com> From: Yasushi SHOJI <yashi@atmark-techno.com> Newsgroups: gmane.comp.version-control.git Subject: [patch] possible memory leak in diff.c::diff_free_filepair() Date: Sat, 13 Aug 2005 19:58:56 +0900 This time I used valgrind to make sure that it does not overeagerly discard memory that is still being used. Signed-off-by: Junio C Hamano <junkio@cox.net> 2005-09-16 01:13:43 +02:00			`* (2) this is not a broken delete, and rename_dst`
			`* does not have a rename/copy to move p->one->path`
			`* out of existence.`
[PATCH] Add -B flag to diff-* brothers. A new diffcore transformation, diffcore-break.c, is introduced. When the -B flag is given, a patch that represents a complete rewrite is broken into a deletion followed by a creation. This makes it easier to review such a complete rewrite patch. The -B flag takes the same syntax as the -M and -C flags to specify the minimum amount of non-source material the resulting file needs to have to be considered a complete rewrite, and defaults to 99% if not specified. As the new test t4008-diff-break-rewrite.sh demonstrates, if a file is a complete rewrite, it is broken into a delete/create pair, which can further be subjected to the usual rename detection if -M or -C is used. For example, if file0 gets completely rewritten to make it as if it were rather based on file1 which itself disappeared, the following happens: The original change looks like this: file0 --> file0' (quite different from file0) file1 --> /dev/null After diffcore-break runs, it would become this: file0 --> /dev/null /dev/null --> file0' file1 --> /dev/null Then diffcore-rename matches them up: file1 --> file0' The internal score values are finer grained now. Earlier maximum of 10000 has been raised to 60000; there is no user visible changes but there is no reason to waste available bits. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-30 09:08:37 +02:00			`*`
			`* Otherwise, the counterpart broken create`
			`* has been turned into a rename-edit; or`
			`* delete did not have a matching create to`
			`* begin with.`
[PATCH] diff: fix the culling of unneeded delete record. The commit 15d061b435a7e3b6bead39df3889f4af78c4b00a [PATCH] Fix the way diffcore-rename records unremoved source. still leaves unneeded delete records in its output stream by mistake, which was covered up by having an extra check to turn such a delete into a no-op downstream. Fix the check in the diffcore-rename to simplify the output routine. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-30 09:08:07 +02:00			`*/`
[PATCH] Add -B flag to diff-* brothers. A new diffcore transformation, diffcore-break.c, is introduced. When the -B flag is given, a patch that represents a complete rewrite is broken into a deletion followed by a creation. This makes it easier to review such a complete rewrite patch. The -B flag takes the same syntax as the -M and -C flags to specify the minimum amount of non-source material the resulting file needs to have to be considered a complete rewrite, and defaults to 99% if not specified. As the new test t4008-diff-break-rewrite.sh demonstrates, if a file is a complete rewrite, it is broken into a delete/create pair, which can further be subjected to the usual rename detection if -M or -C is used. For example, if file0 gets completely rewritten to make it as if it were rather based on file1 which itself disappeared, the following happens: The original change looks like this: file0 --> file0' (quite different from file0) file1 --> /dev/null After diffcore-break runs, it would become this: file0 --> /dev/null /dev/null --> file0' file1 --> /dev/null Then diffcore-rename matches them up: file1 --> file0' The internal score values are finer grained now. Earlier maximum of 10000 has been raised to 60000; there is no user visible changes but there is no reason to waste available bits. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-30 09:08:37 +02:00			`if (DIFF_PAIR_BROKEN(p)) {`
			`/* broken delete */`
			`struct diff_rename_dst *dst =`
			`locate_rename_dst(p->one, 0);`
			`if (dst && dst->pair)`
			`/* counterpart is now rename/copy */`
			`pair_to_free = p;`
			`}`
			`else {`
copy vs rename detection: avoid unnecessary O(n*m) loops The core rename detection had some rather stupid code to check if a pathname was used by a later modification or rename, which basically walked the whole pathname space for all renames for each rename, in order to tell whether it was a pure rename (no remaining users) or should be considered a copy (other users of the source file remaining). That's really silly, since we can just keep a count of users around, and replace all those complex and expensive loops with just testing that simple counter (but this all depends on the previous commit that shared the diff_filespec data structure by using a separate reference count). Note that the reference count is not the same as the rename count: they behave otherwise rather similarly, but the reference count is tied to the allocation (and decremented at de-allocation, so that when it turns zero we can get rid of the memory), while the rename count is tied to the renames and is decremented when we find a rename (so that when it turns zero we know that it was a rename, not a copy). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:20:56 +02:00			`if (p->one->rename_used)`
[PATCH] Add -B flag to diff-* brothers. A new diffcore transformation, diffcore-break.c, is introduced. When the -B flag is given, a patch that represents a complete rewrite is broken into a deletion followed by a creation. This makes it easier to review such a complete rewrite patch. The -B flag takes the same syntax as the -M and -C flags to specify the minimum amount of non-source material the resulting file needs to have to be considered a complete rewrite, and defaults to 99% if not specified. As the new test t4008-diff-break-rewrite.sh demonstrates, if a file is a complete rewrite, it is broken into a delete/create pair, which can further be subjected to the usual rename detection if -M or -C is used. For example, if file0 gets completely rewritten to make it as if it were rather based on file1 which itself disappeared, the following happens: The original change looks like this: file0 --> file0' (quite different from file0) file1 --> /dev/null After diffcore-break runs, it would become this: file0 --> /dev/null /dev/null --> file0' file1 --> /dev/null Then diffcore-rename matches them up: file1 --> file0' The internal score values are finer grained now. Earlier maximum of 10000 has been raised to 60000; there is no user visible changes but there is no reason to waste available bits. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-30 09:08:37 +02:00			`/* this path remains */`
			`pair_to_free = p;`
			`}`
[PATCH] diff: fix the culling of unneeded delete record. The commit 15d061b435a7e3b6bead39df3889f4af78c4b00a [PATCH] Fix the way diffcore-rename records unremoved source. still leaves unneeded delete records in its output stream by mistake, which was covered up by having an extra check to turn such a delete into a no-op downstream. Fix the check in the diffcore-rename to simplify the output routine. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-30 09:08:07 +02:00
			`if (pair_to_free)`
			`;`
			`else`
			`diff_q(&outq, p);`
			`}`
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`else if (!diff_unmodified_pair(p))`
[PATCH] Fix the way diffcore-rename records unremoved source. Earier version of diffcore-rename used to keep unmodified filepair in its output so that the last stage of the processing that tells renames from copies can make all of rename/copy to copies. However this had a bad interaction with other diffcore filters that wanted to run after diffcore-rename, in that such unmodified filepair must be retained for proper distinction between renames and copies to happen. This patch fixes the problem by changing the way diffcore-rename records the information needed to distinguish "all are copies" case and "the last one is a rename" case. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-28 00:55:55 +02:00			`/* all the usual ones need to be kept */`
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`diff_q(&outq, p);`
[PATCH] Fix the way diffcore-rename records unremoved source. Earier version of diffcore-rename used to keep unmodified filepair in its output so that the last stage of the processing that tells renames from copies can make all of rename/copy to copies. However this had a bad interaction with other diffcore filters that wanted to run after diffcore-rename, in that such unmodified filepair must be retained for proper distinction between renames and copies to happen. This patch fixes the problem by changing the way diffcore-rename records the information needed to distinguish "all are copies" case and "the last one is a rename" case. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-28 00:55:55 +02:00			`else`
			`/* no need to keep unmodified pairs */`
			`pair_to_free = p;`

[PATCH] Introduce diff_free_filepair() funcion. This introduces a new function to free a common data structure, and plugs some leaks. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-28 00:50:30 +02:00			`if (pair_to_free)`
			`diff_free_filepair(pair_to_free);`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`}`
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`diff_debug_queue("done copying original", &outq);`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`free(q->queue);`
			`*q = outq;`
			`diff_debug_queue("done collapsing", q);`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00
Ref-count the filespecs used by diffcore Rather than copy the filespecs when introducing new versions of them (for rename or copy detection), use a refcount and increment the count when reusing the diff_filespec. This avoids unnecessary allocations, but the real reason behind this is a future enhancement: we will want to track shared data across the copy/rename detection. In order to efficiently notice when a filespec is used by a rename, the rename machinery wants to keep track of a rename usage count which is shared across all different users of the filespec. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2007-10-25 20:19:10 +02:00			`for (i = 0; i < rename_dst_nr; i++)`
			`free_filespec(rename_dst[i].two);`
Plug diff leaks. It is a bit embarrassing that it took this long for a fix since the problem was first reported on Aug 13th. Message-ID: <87y876gl1r.wl@mail2.atmark-techno.com> From: Yasushi SHOJI <yashi@atmark-techno.com> Newsgroups: gmane.comp.version-control.git Subject: [patch] possible memory leak in diff.c::diff_free_filepair() Date: Sat, 13 Aug 2005 19:58:56 +0900 This time I used valgrind to make sure that it does not overeagerly discard memory that is still being used. Signed-off-by: Junio C Hamano <junkio@cox.net> 2005-09-16 01:13:43 +02:00
[PATCH] Redo rename/copy detection logic. Earlier implementation had a major screw-up in the memory management area. Rename/copy logic sometimes borrowed a pointer to a structure without any provision for downstream to determine which pointer is shared and which is not. This resulted in the later clean-up code to sometimes double free such structure, resulting in a segfault. This made -M and -C useless. Another problem the earlier implementation had was that it reordered the patches, and forced the logic to differentiate renames and copies to depend on that particular order. This problem was fixed by teaching rename/copy detection logic not to do any reordering, and rename-copy differentiator not to depend on the order of the patches. The diffs will leave rename/copy detector in the same destination path order as the patch that was fed into it. Some test vectors have been reordered to accommodate this change. It also adds a sanity check logic to the human-readable diff-raw output to detect paths with embedded TAB and LF characters, which cannot be expressed with that format. This idea came up during a discussion with Chris Wedgwood. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-24 10:10:48 +02:00			`free(rename_dst);`
			`rename_dst = NULL;`
			`rename_dst_nr = rename_dst_alloc = 0;`
			`free(rename_src);`
			`rename_src = NULL;`
			`rename_src_nr = rename_src_alloc = 0;`
[PATCH] Diff overhaul, adding half of copy detection. This introduces the diff-core, the layer between the diff-tree family and the external diff interface engine. The calls to the interface diff-tree family uses (diff_change and diff_addremove) have not changed and will not change. The purpose of the diff-core layer is to provide an infrastructure to transform the set of differences sent from the applications, before sending them to the external diff interface. The recently introduced rename detection code has been rewritten to use the diff-core facility. When applications send in separate creates and deletes, matching ones are transformed into a single rename-and-edit diff, and sent out to the external diff interface as such. This patch also enhances the rename detection code further to be able to detect copies. Currently this happens only as long as copy sources appear as part of the modified files, but there already is enough provision for callers to report unmodified files to diff-core, so that they can be also used as copy source candidates. Extending the callers this way will be done in a separate patch. Please see and marvel at how well this works by trying out the newly added t/t4003-diff-rename-1.sh test script. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org> 2005-05-21 11:39:09 +02:00			`return;`
			`}`