This is the first cut at a git-rev-list that knows to ignore commits that
don't change a certain file (or set of files).
NOTE! For now it only prunes _merge_ commits, and follows the parent where
there are no differences in the set of files specified. In the long run,
I'd like to make it re-write the straight-line history too, but for now
the merge simplification is much more fundamentally important (the
rewriting of straight-line history is largely a separate simplification
phase, but the merge simplification needs to happen early if we want to
optimize away unnecessary commit parsing).
If all parents of a merge change some of the files, the merge is left as
is, so the end result is in no way guaranteed to be a linear history, but
it will often be a lot /more/ linear than the full tree, since it prunes
out parents that didn't matter for that set of files.
As an example from the current kernel:
[torvalds@g5 linux]$ git-rev-list HEAD | wc -l
9885
[torvalds@g5 linux]$ git-rev-list HEAD -- Makefile | wc -l
4084
[torvalds@g5 linux]$ git-rev-list HEAD -- drivers/usb | wc -l
5206
and you can also use 'gitk' to more visually see the pruning of the
history tree, with something like
gitk -- drivers/usb
showing a simplified history that tries to follow the first parent in a
merge that is the parent that fully defines drivers/usb/.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
I took a look at webgit, and it looks like at least for the "projects"
page, the most common operation ends up being basically
git-rev-list --header --parents --max-count=1 HEAD
Now, the thing is, the way "git-rev-list" works, it always keeps on
popping the parents and parsing them in order to build the list of
parents, and it turns out that even though we just want a single commit,
git-rev-list will invariably look up _three_ generations of commits.
It will parse:
- the commit we want (it obviously needs this)
- it's parent(s) as part of the "pop_most_recent_commit()" logic
- it will then pop one of the parents before it notices that it doesn't
need any more
- and as part of popping the parent, it will parse the grandparent (again
due to "pop_most_recent_commit()".
Now, I've strace'd it, and it really is pretty efficient on the whole, but
if things aren't nicely cached, and with long-latency IO, doing those two
extra objects (at a minimum - if the parent is a merge it will be more) is
just wasted time, and potentially a lot of it.
So here's a quick special-case for the trivial case of "just one commit,
and no date-limits or other special rules".
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Cloning from a repository with more than 256 refs (heads and tags
included) will choke, because upload-pack has a built-in limit of
feeding not more than MAX_NEEDS (currently 256) heads to underlying
git-rev-list. This is a problem when cloning a repository with many
tags, like http://www.linux-mips.org/pub/scm/linux.git, which has 290+
tags.
This commit introduces a new flag, --all, to git-rev-list, to include
all refs in the repository. Updated upload-pack detects requests that
ask more than MAX_NEEDS refs, and sends everything back instead.
We may probably want to tweak the definitions of MAX_NEEDS and
MAX_HAS, but that is a separate topic.
Signed-off-by: Junio C Hamano <junkio@cox.net>
A carefully crafted pathname can be used to disrupt downstream git-pack-objects
that uses 'git-rev-list --objects' output. Prevent this.
Signed-off-by: Junio C Hamano <junkio@cox.net>
The trick is to consider the time-based filtering a limiter, the same way
we do for release ranges.
That means that the time-based filtering runs _before_ the topological
sorting, which makes it meaningful again. It also simplifies the code
logic.
This makes "gitk" useful with time ranges.
[ Second version: --merge-order now unaffected by the re-org ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
There were two bugs in there:
- if the range didn't end up working, we restored the '.' character in
the wrong place.
- an empty end-of-range should be interpreted as HEAD.
See rev-parse.c for the reference implementation of this.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
The object parsing code builds a generic "this object references that
object" because doing a full connectivity check for fsck requires it.
However, nothing else really needs it, and it's quite expensive for
git-rev-list that can have tons of objects in flight.
So, exactly like the commit buffer save thing, add a global flag to
disable it, and use it in git-rev-list.
Before:
$ /usr/bin/time git-rev-list --objects v2.6.12..HEAD | wc -l
12.28user 0.29system 0:12.57elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+26718minor)pagefaults 0swaps
59124
After this change:
$ /usr/bin/time git-rev-list --objects v2.6.12..HEAD | wc -l
10.33user 0.18system 0:10.54elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+18509minor)pagefaults 0swaps
59124
and note how the number of pages touched by git-rev-list for this
particular object list has shrunk from 26,718 (104 MB) to 18,509 (72 MB).
Calculating the total object difference between two git revisions is still
clearly the most expensive git operation (both in memory and CPU time),
but it's now less than 40% of what it used to be.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This avoids keeping tree entries around, and free's them as it traverses
the list. This avoids building up a huge memory footprint just for these
small but very common allocations.
Before:
$ /usr/bin/time git-rev-list --objects v2.6.12..HEAD | wc -l
11.65user 0.38system 0:12.65elapsed 95%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+42934minor)pagefaults 0swaps
59124
After:
$ /usr/bin/time git-rev-list --objects v2.6.12..HEAD | wc -l
12.28user 0.29system 0:12.57elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+26718minor)pagefaults 0swaps
59124
Note how the minor fault numbers - which ends up being how many pages we
needed to map - go down from 42934 (167 MB) to 26718 (104 MB). That is:
Before:
42934 minor pagefaults
After:
26718 minor pagefaults
This is all in _addition_ to the previous fixes. It used to be
~48,000 pagefaults.
That's still a honking big memory footprint, but it's about half of what
it was just a day or two ago (and this is the object list for a pretty big
update - almost 60,000 objects. Smaller updates need less memory).
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
The logic to calculate the full object list used to be very inter-twined
with the logic that looked up the commits.
For no good reason - it's actually a lot simpler to just do that logic
as a separate pass.
This improves performance a bit, and uses slightly less memory in my
tests, but more importantly it makes the code simpler to work with and
follow what it does.
The performance win is less than I had hoped for, but I get:
Before:
[torvalds@g5 linux]$ /usr/bin/time git-rev-list --objects v2.6.12..HEAD | wc -l
13.64user 0.42system 0:14.13elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+47947minor)pagefaults 0swaps
58945
After:
[torvalds@g5 linux]$ /usr/bin/time git-rev-list --objects v2.6.12..HEAD | wc -l
11.80user 0.36system 0:12.16elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+42684minor)pagefaults 0swaps
58945
ie it improved by 2 seconds, and took a 5000+ fewer pages (hey, that's
20MB out of 174MB to go). And got the same number of objects (in theory,
the more expensive one might find some more shared objects to avoid. In
practice it obviously doesn't).
I know how to make it use _lots_ less memory, which will probably speed it
up. But that's for another time, and I'd prefer to see this go in first.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
As pointed out on the list, git-rev-list can use a lot of memory.
One low-hanging fruit is to free the commit buffer for commits that we
parse. By default, parse_commit() will save away the buffer, since a lot
of cases do want it, and re-reading it continually would be unnecessary.
However, in many cases the buffer isn't actually necessary and saving it
just wastes memory.
We could just free the buffer ourselves, but especially in git-rev-list,
we actually end up using the helper functions that automatically add
parent commits to the commit lists, so we don't actually control the
commit parsing directly.
Instead, just make this behaviour of "parse_commit()" a global flag.
Maybe this is a bit tasteless, but it's very simple, and it makes a
noticable difference in memory usage.
Before the change:
[torvalds@g5 linux]$ /usr/bin/time git-rev-list v2.6.12..HEAD > /dev/null
0.26user 0.02system 0:00.28elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+3714minor)pagefaults 0swaps
after the change:
[torvalds@g5 linux]$ /usr/bin/time git-rev-list v2.6.12..HEAD > /dev/null
0.26user 0.00system 0:00.27elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2433minor)pagefaults 0swaps
note how the minor faults have decreased from 3714 pages to 2433 pages.
That's all due to the fewer anonymous pages allocated to hold the comment
buffers and their metadata.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Recent changes in git have broken cg-log. git-rev-list no longer
prints "commit" in front of commit hashes. It turn out a local
"prefix" variable in main() shadows a file-scoped "prefix" variable.
The patch removed the local "prefix" variable since its value is never
used (in the intended way, that is). The call to
setup_git_directory() is kept since it has useful side effects.
The file-scoped "prefix" variable is renamed to "commit_prefix" just
in case someone reintroduces "prefix" to hold the return value of
setup_git_directory().
Signed-off-by: Pavel Roskin <proski@gnu.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This trivial patch makes "git-rev-list" able to handle not being in
the top-level directory. This magically also makes "git-whatchanged"
do the right thing.
Trivial scripting fix to make sure that "git log" also works.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
When following tags, check for parse_object() success and error out
properly instead of segfaulting.
Signed-off-by: Sergey Vlasov <vsu@altlinux.ru>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This introduces --pretty=oneline to git-rev-tree and
git-rev-list commands to show only the first line of the commit
message, without frills.
Signed-off-by: Junio C Hamano <junkio@cox.net>
As requested by Junio (who suggested --single-parents-only, but this
could forget a no-parent root).
Also, adds a few missing options to the usage string.
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Junio C Hamano <junkio@cox.net>
The King Penguin says:
Now, for extra bonus points, maybe you should make "git-rev-list" also
understand the "rev..rev" format (which you can't do with just the
get_sha1() interface, since it expands into more).
The faithful servant makes it so.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Support for completely OpenSSL-less builds. FSF considers distributing GPL
binaries with OpenSSL linked in as a legal problem so this is trouble
e.g. for Debian, or some people might not want to install OpenSSL
anyway. If you
make NO_OPENSSL=1
you get completely OpenSSL-less build, disabling --merge-order and using
Mozilla's SHA1 implementation.
Ported from Cogito.
Signed-off-by: Petr Baudis <pasky@ucw.cz>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This corner-case was triggered by a kernel commit that was not in date
order, due to a misconfigured time zone that made the commit appear three
hours older than it was.
That caused git-rev-list to traverse the commit tree in a non-obvious
order, and made it parse several of the _parents_ of the misplaced commit
before it actually parsed the commit itself. That's fine, but it meant
that the grandparents of the commit didn't get marked uninteresting,
because they had been reached through an "interesting" branch.
The reason was that "mark_parents_uninteresting()" (which is supposed to
mark all existing parents as being uninteresting - duh) didn't actually
traverse more than one level down the parent chain.
NORMALLY this is fine, since with the date-based traversal order,
grandparents won't ever even have been looked at before their parents (so
traversing the chain down isn't needed, because the next time around when
we pick out the parent we'll mark _its_ parents uninteresting), but since
we'd gotten out of order, we'd already seen the parent and thus never got
around to mark the grandparents.
Anyway, the fix is simple. Just traverse parent chains recursively.
Normally the chain won't even exist (since the parent hasn't been parsed
yet), so this is not actually going to trigger except in this strange
corner-case.
Add a comment to the simple one-liner, since this was a bit subtle, and I
had to really think things through to understand how it could happen.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
We'll mark all the trees at the edges (as deep as we had to go to
realize that we have all the commits needed) as uninteresting.
Otherwise we'll occasionally list a lot of objects that were actually
available at the edge in a commit that we just never ended up parsing
because we could determine early that we had all relevant commits.
NOTE! The object listing is still just a _heuristic_. It's guaranteed
to list a superset of the actual new objects, but there might be the
occasional old object in the list, just because the commit that
referenced it was much further back in the history.
For example, let's say that a recent commit is a revert of part of the
tree to much older state: since we didn't walk _that_ far back in the
commit history tree to list the commits necessary, git-rev-tree will
never have marked the old objects uninteresting, and we'll end up
listing them as "new".
That's ok.
When we allow a tag object in place of a commit object, we only
dereferenced the given tag once, which causes a tag that points at a tag
that points at a commit to be rejected. Instead, dereference tag
repeatedly until we get a non-tag.
This patch makes change to two functions:
- commit.c::lookup_commit_reference() is used by merge-base,
rev-tree and rev-parse to convert user supplied SHA1 to that of
a commit.
- rev-list uses its own get_commit_reference() to do the same.
Dereferencing tags this way helps both of these uses.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This change ensures that git-rev-list --merge-order produces the same result
irrespective of what position the --merge-order argument appears in the argument
list.
Signed-off-by: Jon Seymour <jon.seymour@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
That's what we should have done in the first place, since it not only
avoids another unnecessary flag, it also protects the commits from
showing up as duplicates later when they show up as parents of another
commit (in the pop_most_recent_commit() path).
This will hopefully also fix --topo-sort.
This patch implements a small tidy up of rev-list.c to reduce
(but not eliminate) the amount of ugliness associated
with the merge_order flag.
Signed-off-by: Jon Seymour <jon.seymour@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Not only is it unnecessary, it incorrectly allows extraneous characters
at the end of the argument.
Junio noticed the --merge-order thing, and Jon points out that if we fix
that one, we should fix --show-breaks too.
Now you can give git-rev-list tags, trees and blobs, and it will do the
proper reachability for them all. Knock wood.
Of course, you need the "--objects" flag to do anything but plain
commits.
We want to be able to just say "give a difference between these
objects", rather than limiting it to commits only. This isn't there
yet, but it sets things up to be a bit easier.
When you do
git-rev-list --objects $(git-rev-parse HEAD^..HEAD)
it now lists not only the "commit difference" between the parent of HEAD
and HEAD itself (which is normally just the parent, but in the case of a
merge will be all the newly merged commits), but also all the new tree
and blob objects that weren't in the original.
NOTE! It doesn't walk all the way to the root, so it doesn't do a full
object search in the full old history. Instead, it will only look as
far back in the history as it needs to resolve the commits. Thus, if
the commit reverts a blob (or tree) back to a state much further back in
history, we may end up listing some blobs (or trees) as "new" even
though they exist further back.
Regardless, the list of objects will be a superset (usually exact) list
of objects needed to go from the beginning commit to ending commit.
As a particularly obvious special case,
git-rev-list --objects HEAD
will end up listing every single object that is reachable from the HEAD
commit.
Side note: the objects are sorted by "recency", with commits first.
This patch fixes a problem reported by Paul Mackerras regarding the interaction
of the --merge-order and --max-age switches of git-rev-list.
This patch applies to the current Linus HEAD. A cleaner fix for the same problem
in my current HEAD will follow later.
With this change, --merge-order produces the same result as no --merge-order
on the linux-2.6 git repository, to wit:
$> git-rev-list --max-age=1116330140 bcfff0b471a60df350338bcd727fc9b8a6aa54b2 | wc -l
655
$> git-rev-list --merge-order --max-age=1116330140 bcfff0b471a60df350338bcd727fc9b8a6aa54b2 | wc -l
655
Signed-off-by: Jon Seymour <jon.seymour@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If b is reachable from a, then:
git-rev-list a b
argument would print one of the commits twice.
This patch fixes that problem. A previous problem fixed it for the
--merge-order switch.
Signed-off-by: Jon Seymour <jon.seymour@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is useful for doing binary searching for problems. You start with
a known good and known bad point, and you then test the "halfway" point
in between:
git-rev-list --bisect bad ^good
and you test that. If that one tests good, you now still have a known
bad case, but two known good points, and you can bisect again:
git-rev-list --bisect bad ^good1 ^good2
and test that point. If that point is bad, you now use that as your
known-bad starting point:
git-rev-list --bisect newbad ^good1 ^good2
and basically at every iteration you shrink your list of commits by
half: you're binary searching for the point where the troubles started,
even though there isn't a nice linear ordering.
This patch tidies up the git-rev-list documentation and epoch.c, which
are in severe clash with the unwritten coding style now, and quite
unreadable.
It also fixes up compile failures with older compilers due to variable
declarations after code.
The patch mostly wraps lines before or on the 80th column, removes
plenty of superfluous empty lines and changes comments from // to /* */.
Signed-off-by: Petr Baudis <pasky@ucw.cz>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch linearises the GIT commit history graph into merge order
which is defined by invariants specified in Documentation/git-rev-list.txt.
The linearisation produced by this patch is superior in an objective sense
to that produced by the existing git-rev-list implementation in that
the linearisation produced is guaranteed to have the minimum number of
discontinuities, where a discontinuity is defined as an adjacent pair of
commits in the output list which are not related in a direct child-parent
relationship.
With this patch a graph like this:
a4 ---
| \ \
| b4 |
|/ | |
a3 | |
| | |
a2 | |
| | c3
| | |
| | c2
| b3 |
| | /|
| b2 |
| | c1
| | /
| b1
a1 |
| |
a0 |
| /
root
Sorts like this:
= a4
| c3
| c2
| c1
^ b4
| b3
| b2
| b1
^ a3
| a2
| a1
| a0
= root
Instead of this:
= a4
| c3
^ b4
| a3
^ c2
^ b3
^ a2
^ b2
^ c1
^ a1
^ b1
^ a0
= root
A test script, t/t6000-rev-list.sh, includes a test which demonstrates
that the linearisation produced by --merge-order has less discontinuities
than the linearisation produced by git-rev-list without the --merge-order
flag specified. To see this, do the following:
cd t
./t6000-rev-list.sh
cd trash
cat actual-default-order
cat actual-merge-order
The existing behaviour of git-rev-list is preserved, by default. To obtain
the modified behaviour, specify --merge-order or --merge-order --show-breaks
on the command line.
This version of the patch has been tested on the git repository and also on the linux-2.6
repository and has reasonable performance on both - ~50-100% slower than the original algorithm.
This version of the patch has incorporated a functional equivalent of the Linus' output limiting
algorithm into the merge-order algorithm itself. This operates per the notes associated
with Linus' commit 337cb3fb8d.
This version has incorporated Linus' feedback regarding proposed changes to rev-list.c.
(see: [PATCH] Factor out filtering in rev-list.c)
This version has improved the way sort_first_epoch marks commits as uninteresting.
For more details about this change, refer to Documentation/git-rev-list.txt
and http://blackcubes.dyndns.org/epoch/.
Signed-off-by: Jon Seymour <jon.seymour@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
You can ask to print out "raw" format (full headers, full body),
"medium" format (author and date, full body) or "short" format
(author only, condensed body).
Use "git-rev-list --pretty=short HEAD | less -S" for an example.
This makes git-rev-list use the same command line syntax to mark the
commits as git-rev-tree does, and instead of just allowing a start and
end commit, it allows an arbitrary list of "interesting" and "uninteresting"
commits.
For example, imagine that you had three branches (a, b and c) that you
are interested in, but you don't want to see stuff that already exists
in another persons three releases (x, y and z). You can do
git-rev-list a b c ^x ^y ^z
(order doesn't matter, btw - feel free to put the uninteresting ones
first or otherwise swithc them around), and it will show all the
commits that are reachable from a/b/c but not reachable from x/y/z.
The old syntax "git-rev-list start end" would not be written as
"git-rev-list start ^end", or "git-rev-list ^end start".
There's no limit to the number of heads you can specify (unlike
git-rev-tree, which can handle a maximum of 16 heads).