1
0
Fork 0
mirror of https://github.com/git/git.git synced 2024-11-15 13:43:45 +01:00
Find a file
Kirill Smelkov 72441af7c4 tree-diff: rework diff_tree() to generate diffs for multiparent cases as well
Previously diff_tree(), which is now named ll_diff_tree_sha1(), was
generating diff_filepair(s) for two trees t1 and t2, and that was
usually used for a commit as t1=HEAD~, and t2=HEAD - i.e. to see changes
a commit introduces.

In Git, however, we have fundamentally built flexibility in that a
commit can have many parents - 1 for a plain commit, 2 for a simple merge,
but also more than 2 for merging several heads at once.

For merges there is a so called combine-diff, which shows diff, a merge
introduces by itself, omitting changes done by any parent. That works
through first finding paths, that are different to all parents, and then
showing generalized diff, with separate columns for +/- for each parent.
The code lives in combine-diff.c .

There is an impedance mismatch, however, in that a commit could
generally have any number of parents, and that while diffing trees, we
divide cases for 2-tree diffs and more-than-2-tree diffs. I mean there
is no special casing for multiple parents commits in e.g.
revision-walker .

That impedance mismatch *hurts* *performance* *badly* for generating
combined diffs - in "combine-diff: optimize combine_diff_path
sets intersection" I've already removed some slowness from it, but from
the timings provided there, it could be seen, that combined diffs still
cost more than an order of magnitude more cpu time, compared to diff for
usual commits, and that would only be an optimistic estimate, if we take
into account that for e.g. linux.git there is only one merge for several
dozens of plain commits.

That slowness comes from the fact that currently, while generating
combined diff, a lot of time is spent computing diff(commit,commit^2)
just to only then intersect that huge diff to almost small set of files
from diff(commit,commit^1).

That's because at present, to compute combine-diff, for first finding
paths, that "every parent touches", we use the following combine-diff
property/definition:

D(A,P1...Pn) = D(A,P1) ^ ... ^ D(A,Pn)      (w.r.t. paths)

where

D(A,P1...Pn) is combined diff between commit A, and parents Pi

and

D(A,Pi) is usual two-tree diff Pi..A

So if any of that D(A,Pi) is huge, tracting 1 n-parent combine-diff as n
1-parent diffs and intersecting results will be slow.

And usually, for linux.git and other topic-based workflows, that
D(A,P2) is huge, because, if merge-base of A and P2, is several dozens
of merges (from A, via first parent) below, that D(A,P2) will be diffing
sum of merges from several subsystems to 1 subsystem.

The solution is to avoid computing n 1-parent diffs, and to find
changed-to-all-parents paths via scanning A's and all Pi's trees
simultaneously, at each step comparing their entries, and based on that
comparison, populate paths result, and deduce we could *skip*
*recursing* into subdirectories, if at least for 1 parent, sha1 of that
dir tree is the same as in A. That would save us from doing significant
amount of needless work.

Such approach is very similar to what diff_tree() does, only there we
deal with scanning only 2 trees simultaneously, and for n+1 tree, the
logic is a bit more complex:

D(T,P1...Pn) calculation scheme
-------------------------------

D(T,P1...Pn) = D(T,P1) ^ ... ^ D(T,Pn)	(regarding resulting paths set)

    D(T,Pj)		- diff between T..Pj
    D(T,P1...Pn)	- combined diff from T to parents P1,...,Pn

We start from all trees, which are sorted, and compare their entries in
lock-step:

     T     P1       Pn
     -     -        -
    |t|   |p1|     |pn|
    |-|   |--| ... |--|      imin = argmin(p1...pn)
    | |   |  |     |  |
    |-|   |--|     |--|
    |.|   |. |     |. |
     .     .        .
     .     .        .

at any time there could be 3 cases:

    1)  t < p[imin];
    2)  t > p[imin];
    3)  t = p[imin].

Schematic deduction of what every case means, and what to do, follows:

1)  t < p[imin]  ->  ∀j t ∉ Pj  ->  "+t" ∈ D(T,Pj)  ->  D += "+t";  t↓

2)  t > p[imin]

    2.1) ∃j: pj > p[imin]  ->  "-p[imin]" ∉ D(T,Pj)  ->  D += ø;  ∀ pi=p[imin]  pi↓
    2.2) ∀i  pi = p[imin]  ->  pi ∉ T  ->  "-pi" ∈ D(T,Pi)  ->  D += "-p[imin]";  ∀i pi↓

3)  t = p[imin]

    3.1) ∃j: pj > p[imin]  ->  "+t" ∈ D(T,Pj)  ->  only pi=p[imin] remains to investigate
    3.2) pi = p[imin]  ->  investigate δ(t,pi)
     |
     |
     v

    3.1+3.2) looking at δ(t,pi) ∀i: pi=p[imin] - if all != ø  ->

                      ⎧δ(t,pi)  - if pi=p[imin]
             ->  D += ⎨
                      ⎩"+t"     - if pi>p[imin]

    in any case t↓  ∀ pi=p[imin]  pi↓

~

For comparison, here is how diff_tree() works:

D(A,B) calculation scheme
-------------------------

    A     B
    -     -
   |a|   |b|    a < b   ->  a ∉ B   ->   D(A,B) +=  +a    a↓
   |-|   |-|    a > b   ->  b ∉ A   ->   D(A,B) +=  -b    b↓
   | |   | |    a = b   ->  investigate δ(a,b)            a↓ b↓
   |-|   |-|
   |.|   |.|
    .     .
    .     .

~~~~~~~~

This patch generalizes diff tree-walker to work with arbitrary number of
parents as described above - i.e. now there is a resulting tree t, and
some parents trees tp[i] i=[0..nparent). The generalization builds on
the fact that usual diff

D(A,B)

is by definition the same as combined diff

D(A,[B]),

so if we could rework the code for common case and make it be not slower
for nparent=1 case, usual diff(t1,t2) generation will not be slower, and
multiparent diff tree-walker would greatly benefit generating
combine-diff.

What we do is as follows:

1) diff tree-walker ll_diff_tree_sha1() is internally reworked to be
   a paths generator (new name diff_tree_paths()), with each generated path
   being `struct combine_diff_path` with info for path, new sha1,mode and for
   every parent which sha1,mode it was in it.

2) From that info, we can still generate usual diff queue with
   struct diff_filepairs, via "exporting" generated
   combine_diff_path, if we know we run for nparent=1 case.
   (see emit_diff() which is now named emit_diff_first_parent_only())

3) In order for diff_can_quit_early(), which checks

       DIFF_OPT_TST(opt, HAS_CHANGES))

   to work, that exporting have to be happening not in bulk, but
   incrementally, one diff path at a time.

   For such consumers, there is a new callback in diff_options
   introduced:

       ->pathchange(opt, struct combine_diff_path *)

   which, if set to !NULL, is called for every generated path.

   (see new compat ll_diff_tree_sha1() wrapper around new paths
    generator for setup)

4) The paths generation itself, is reworked from previous
   ll_diff_tree_sha1() code according to "D(A,P1...Pn) calculation
   scheme" provided above:

   On the start we allocate [nparent] arrays in place what was
   earlier just for one parent tree.

   then we just generalize loops, and comparison according to the
   algorithm.

Some notes(*):

1) alloca(), for small arrays, is used for "runs not slower for
   nparent=1 case than before" goal - if we change it to xmalloc()/free()
   the timings get ~1% worse. For alloca() we use just-introduced
   xalloca/xalloca_free compatibility wrappers, so it should not be a
   portability problem.

2) For every parent tree, we need to keep a tag, whether entry from that
   parent equals to entry from minimal parent. For performance reasons I'm
   keeping that tag in entry's mode field in unused bit - see S_IFXMIN_NEQ.
   Not doing so, we'd need to alloca another [nparent] array, which hurts
   performance.

3) For emitted paths, memory could be reused, if we know the path was
   processed via callback and will not be needed later. We use efficient
   hand-made realloc-style path_appendnew(), that saves us from ~1-1.5%
   of potential additional slowdown.

4) goto(s) are used in several places, as the code executes a little bit
   faster with lowered register pressure.

Also

- we should now check for FIND_COPIES_HARDER not only when two entries
  names are the same, and their hashes are equal, but also for a case,
  when a path was removed from some of all parents having it.

  The reason is, if we don't, that path won't be emitted at all (see
  "a > xi" case), and we'll just skip it, and FIND_COPIES_HARDER wants
  all paths - with diff or without - to be emitted, to be later analyzed
  for being copies sources.

  The new check is only necessary for nparent >1, as for nparent=1 case
  xmin_eqtotal always =1 =nparent, and a path is always added to diff as
  removal.

~~~~~~~~

Timings for

    # without -c, i.e. testing only nparent=1 case
    `git log --raw --no-abbrev --no-renames`

before and after the patch are as follows:

                navy.git        linux.git v3.10..v3.11

    before      0.611s          1.889s
    after       0.619s          1.907s
    slowdown    1.3%            0.9%

This timings show we did no harm to usual diff(tree1,tree2) generation.
From the table we can see that we actually did ~1% slowdown, but I think
I've "earned" that 1% in the previous patch ("tree-diff: reuse base
str(buf) memory on sub-tree recursion", HEAD~~) so for nparent=1 case,
net timings stays approximately the same.

The output also stayed the same.

(*) If we revert 1)-4) to more usual techniques, for nparent=1 case,
    we'll get ~2-2.5% of additional slowdown, which I've tried to avoid, as
   "do no harm for nparent=1 case" rule.

For linux.git, combined diff will run an order of magnitude faster and
appropriate timings will be provided in the next commit, as we'll be
taking advantage of the new diff tree-walker for combined-diff
generation there.

P.S. and combined diff is not some exotic/for-play-only stuff - for
example for a program I write to represent Git archives as readonly
filesystem, there is initial scan with

    `git log --reverse --raw --no-abbrev --no-renames -c`

to extract log of what was created/changed when, as a result building a
map

    {}  sha1    ->  in which commit (and date) a content was added

that `-c` means also show combined diff for merges, and without them, if
a merge is non-trivial (merges changes from two parents with both having
separate changes to a file), or an evil one, the map will not be full,
i.e. some valid sha1 would be absent from it.

That case was my initial motivation for combined diffs speedup.

Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-04-07 14:40:46 -07:00
block-sha1
builtin Merge branch 'sb/repack-in-c' 2014-01-27 10:45:49 -08:00
compat Merge branch 'ef/mingw-write' 2014-01-27 10:44:59 -08:00
contrib Merge branch 'jk/complete-merge-base' 2014-01-27 10:43:55 -08:00
Documentation Git 1.9-rc1 2014-01-27 11:01:35 -08:00
git-gui git-gui 0.19.0 2014-01-21 13:16:17 -08:00
gitk-git Merge git://ozlabs.org/~paulus/gitk 2014-01-23 08:50:50 -08:00
gitweb
mergetools
perl git-svn: memoize _rev_list and rebuild 2014-01-23 02:54:26 +00:00
po l10n: Bulgarian translation of git (222t21f1967u) 2014-01-29 14:29:15 +02:00
ppc
t tests: add checking that combine-diff emits only correct paths 2014-02-24 14:44:57 -08:00
templates
vcs-svn
xdiff
.gitattributes
.gitignore
.mailmap
abspath.c
aclocal.m4
advice.c
advice.h
alias.c
alloc.c
archive-tar.c
archive-zip.c
archive.c
archive.h
argv-array.c
argv-array.h
attr.c
attr.h
base85.c
bisect.c
bisect.h
blob.c
blob.h
branch.c
branch.h
builtin.h
bulk-checkin.c
bulk-checkin.h
bundle.c
bundle.h
cache-tree.c
cache-tree.h
cache.h tree-diff: rework diff_tree() to generate diffs for multiparent cases as well 2014-04-07 14:40:46 -07:00
check-builtins.sh
check-racy.c
check_bindir
color.c
color.h
column.c
column.h
combine-diff.c combine-diff: move changed-paths scanning logic into its own function 2014-02-24 14:46:11 -08:00
command-list.txt
commit-slab.h
commit.c
commit.h Merge branch 'nd/shallow-clone' 2014-01-17 12:21:20 -08:00
config.c
config.mak.in
config.mak.uname Portable alloca for Git 2014-03-27 11:54:01 -07:00
configure.ac Portable alloca for Git 2014-03-27 11:54:01 -07:00
connect.c Merge branch 'nd/shallow-clone' 2014-01-17 12:21:20 -08:00
connect.h
connected.c Merge branch 'nd/shallow-clone' 2014-01-17 12:21:20 -08:00
connected.h
convert.c
convert.h
copy.c
COPYING
credential-cache--daemon.c
credential-cache.c
credential-store.c
credential.c
credential.h
csum-file.c
csum-file.h
ctype.c
daemon.c
date.c
decorate.c
decorate.h
delta.h
diff-delta.c
diff-lib.c combine-diff: combine_diff_path.len is not needed anymore 2014-02-24 14:44:57 -08:00
diff-no-index.c
diff.c tree-diff: rework diff_tree() to generate diffs for multiparent cases as well 2014-04-07 14:40:46 -07:00
diff.h tree-diff: rework diff_tree() to generate diffs for multiparent cases as well 2014-04-07 14:40:46 -07:00
diffcore-break.c
diffcore-delta.c
diffcore-order.c diffcore-order: export generic ordering interface 2014-02-24 14:44:57 -08:00
diffcore-pickaxe.c
diffcore-rename.c
diffcore.h diffcore-order: export generic ordering interface 2014-02-24 14:44:57 -08:00
dir.c Merge branch 'mh/safe-create-leading-directories' 2014-01-27 10:45:33 -08:00
dir.h
editor.c
entry.c
environment.c Merge branch 'nd/shallow-clone' 2014-01-17 12:21:20 -08:00
exec_cmd.c
exec_cmd.h
fast-import.c
fetch-pack.c Merge branch 'jk/allow-fetch-onelevel-refname' 2014-01-27 10:44:14 -08:00
fetch-pack.h Merge branch 'nd/shallow-clone' 2014-01-17 12:21:20 -08:00
fmt-merge-msg.h
fsck.c
fsck.h
generate-cmdlist.sh
gettext.c
gettext.h
git-add--interactive.perl
git-am.sh
git-archimport.perl
git-bisect.sh
git-compat-util.h Portable alloca for Git 2014-03-27 11:54:01 -07:00
git-cvsexportcommit.perl
git-cvsimport.perl
git-cvsserver.perl
git-difftool--helper.sh
git-difftool.perl
git-filter-branch.sh
git-instaweb.sh
git-merge-octopus.sh
git-merge-one-file.sh
git-merge-resolve.sh
git-mergetool--lib.sh
git-mergetool.sh
git-p4.py git p4: fix an error message when "p4 where" fails 2014-01-22 08:06:19 -08:00
git-parse-remote.sh
git-pull.sh Merge branch 'jk/pull-rebase-using-fork-point' 2014-01-17 12:04:29 -08:00
git-quiltimport.sh
git-rebase--am.sh
git-rebase--interactive.sh
git-rebase--merge.sh
git-rebase.sh
git-relink.perl
git-remote-testgit.sh
git-request-pull.sh
git-send-email.perl Merge branch 'rk/send-email-ssl-cert' 2014-01-27 10:44:34 -08:00
git-sh-i18n.sh
git-sh-setup.sh
git-stash.sh
git-submodule.sh Merge branch 'fp/submodule-checkout-mode' 2014-01-17 12:21:20 -08:00
git-svn.perl
GIT-VERSION-GEN Git 1.9-rc2 2014-01-31 14:16:06 -08:00
git-web--browse.sh
git.c Merge branch 'nd/shallow-clone' 2014-01-17 12:21:20 -08:00
git.rc Makefile: Fix compilation of Windows resource file 2014-01-23 10:00:28 -08:00
git.spec.in
gpg-interface.c
gpg-interface.h
graph.c
graph.h
grep.c
grep.h
hash.c
hash.h
help.c
help.h
hex.c
http-backend.c
http-fetch.c
http-push.c
http-walker.c
http.c
http.h
ident.c
imap-send.c
INSTALL
kwset.c
kwset.h
levenshtein.c
levenshtein.h
LGPL-2.1
line-log.c line-log: convert to using diff_tree_sha1() 2014-02-05 10:50:36 -08:00
line-log.h
line-range.c
line-range.h
list-objects.c Merge branch 'jk/mark-edges-uninteresting' 2014-01-27 10:45:08 -08:00
list-objects.h
ll-merge.c
ll-merge.h
lockfile.c
log-tree.c
log-tree.h
mailmap.c
mailmap.h
Makefile Portable alloca for Git 2014-03-27 11:54:01 -07:00
match-trees.c
merge-blobs.c
merge-blobs.h
merge-recursive.c Merge branch 'mh/safe-create-leading-directories' 2014-01-27 10:45:33 -08:00
merge-recursive.h
merge.c
mergesort.c
mergesort.h
name-hash.c
notes-cache.c
notes-cache.h
notes-merge.c
notes-merge.h
notes-utils.c
notes-utils.h
notes.c
notes.h
object.c
object.h
pack-check.c
pack-revindex.c
pack-revindex.h
pack-write.c
pack.h
pager.c
parse-options-cb.c
parse-options.c
parse-options.h
patch-delta.c
patch-ids.c
patch-ids.h
path.c
pathspec.c
pathspec.h
pkt-line.c
pkt-line.h
preload-index.c
pretty.c
prio-queue.c
prio-queue.h
progress.c
progress.h
prompt.c
prompt.h
quote.c
quote.h
reachable.c
reachable.h
read-cache.c
README
reflog-walk.c
reflog-walk.h
refs.c Merge branch 'mh/safe-create-leading-directories' 2014-01-27 10:45:33 -08:00
refs.h
RelNotes
remote-curl.c Merge branch 'nd/shallow-clone' 2014-01-17 12:21:20 -08:00
remote-testsvn.c
remote.c Merge branch 'mh/retire-ref-fetch-rules' 2014-01-27 10:44:07 -08:00
remote.h Merge branch 'nd/shallow-clone' 2014-01-17 12:21:20 -08:00
replace_object.c
rerere.c
rerere.h
resolve-undo.c
resolve-undo.h
revision.c revision: convert to using diff_tree_sha1() 2014-02-05 10:51:16 -08:00
revision.h
run-command.c
run-command.h
send-pack.c Merge branch 'nd/shallow-clone' 2014-01-17 12:21:20 -08:00
send-pack.h
sequencer.c
sequencer.h
server-info.c
setup.c
sh-i18n--envsubst.c
sha1-array.c
sha1-array.h
sha1-lookup.c
sha1-lookup.h
sha1_file.c Merge branch 'ss/safe-create-leading-dir-with-slash' 2014-01-27 10:45:37 -08:00
sha1_name.c Merge branch 'jk/interpret-branch-name-fix' 2014-01-27 10:44:21 -08:00
shallow.c Merge branch 'nd/shallow-clone' 2014-01-17 12:21:20 -08:00
shell.c
shortlog.h
show-index.c
sideband.c
sideband.h
sigchain.c
sigchain.h
strbuf.c
strbuf.h
streaming.c Merge branch 'ef/mingw-write' 2014-01-27 10:44:59 -08:00
streaming.h
string-list.c
string-list.h
submodule.c
submodule.h
symlinks.c
tag.c
tag.h
tar.h
test-chmtime.c
test-ctype.c
test-date.c
test-delta.c
test-dump-cache-tree.c
test-genrandom.c
test-index-version.c
test-line-buffer.c
test-match-trees.c
test-mergesort.c
test-mktemp.c
test-parse-options.c
test-path-utils.c
test-prio-queue.c
test-read-cache.c
test-regex.c
test-revision-walking.c
test-run-command.c
test-scrap-cache-tree.c
test-sha1.c
test-sha1.sh
test-sigchain.c
test-string-list.c
test-subprocess.c
test-svn-fe.c
test-urlmatch-normalization.c
test-wildmatch.c
thread-utils.c
thread-utils.h
trace.c
transport-helper.c Merge branch 'ef/mingw-write' 2014-01-27 10:44:59 -08:00
transport.c Merge branch 'nd/shallow-clone' 2014-01-17 12:21:20 -08:00
transport.h
tree-diff.c tree-diff: rework diff_tree() to generate diffs for multiparent cases as well 2014-04-07 14:40:46 -07:00
tree-walk.c tree-walk: finally switch over tree descriptors to contain a pre-parsed entry 2014-02-24 14:43:29 -08:00
tree-walk.h tree-walk: finally switch over tree descriptors to contain a pre-parsed entry 2014-02-24 14:43:29 -08:00
tree.c
tree.h
unimplemented.sh
unix-socket.c
unix-socket.h
unpack-trees.c
unpack-trees.h
upload-pack.c Merge branch 'nd/shallow-clone' 2014-01-17 12:21:20 -08:00
url.c
url.h
urlmatch.c
urlmatch.h
usage.c
userdiff.c
userdiff.h
utf8.c
utf8.h
varint.c
varint.h
version.c
version.h
walker.c
walker.h
wildmatch.c
wildmatch.h
wrap-for-bin.sh
wrapper.c
write_or_die.c
ws.c
wt-status.c
wt-status.h
xdiff-interface.c
xdiff-interface.h
zlib.c

////////////////////////////////////////////////////////////////

	Git - the stupid content tracker

////////////////////////////////////////////////////////////////

"git" can mean anything, depending on your mood.

 - random three-letter combination that is pronounceable, and not
   actually used by any common UNIX command.  The fact that it is a
   mispronunciation of "get" may or may not be relevant.
 - stupid. contemptible and despicable. simple. Take your pick from the
   dictionary of slang.
 - "global information tracker": you're in a good mood, and it actually
   works for you. Angels sing, and a light suddenly fills the room.
 - "goddamn idiotic truckload of sh*t": when it breaks

Git is a fast, scalable, distributed revision control system with an
unusually rich command set that provides both high-level operations
and full access to internals.

Git is an Open Source project covered by the GNU General Public
License version 2 (some parts of it are under different licenses,
compatible with the GPLv2). It was originally written by Linus
Torvalds with help of a group of hackers around the net.

Please read the file INSTALL for installation instructions.

See Documentation/gittutorial.txt to get started, then see
Documentation/everyday.txt for a useful minimum set of commands, and
Documentation/git-commandname.txt for documentation of each command.
If git has been correctly installed, then the tutorial can also be
read with "man gittutorial" or "git help tutorial", and the
documentation of each command with "man git-commandname" or "git help
commandname".

CVS users may also want to read Documentation/gitcvs-migration.txt
("man gitcvs-migration" or "git help cvs-migration" if git is
installed).

Many Git online resources are accessible from http://git-scm.com/
including full documentation and Git related tools.

The user discussion and development of Git take place on the Git
mailing list -- everyone is welcome to post bug reports, feature
requests, comments and patches to git@vger.kernel.org (read
Documentation/SubmittingPatches for instructions on patch submission).
To subscribe to the list, send an email with just "subscribe git" in
the body to majordomo@vger.kernel.org. The mailing list archives are
available at http://news.gmane.org/gmane.comp.version-control.git/,
http://marc.info/?l=git and other archival sites.

The maintainer frequently sends the "What's cooking" reports that
list the current status of various development topics to the mailing
list.  The discussion following them give a good reference for
project status, development direction and remaining tasks.