When traversing commits, the selection of commits would heed the list of
pathspecs passed, but subsequent walking of the trees of those commits
would not. This resulted in 'rev-list --objects HEAD -- <paths>'
displaying objects at unwanted paths.
Have process_tree() call tree_entry_interesting() to determine which paths
are interesting and should be walked.
Naturally, this change can provide a large speedup when paths are specified
together with --objects, since many tree entries are now correctly ignored.
Interestingly, though, this change also gives me a small (~1%) but
repeatable speedup even when no paths are specified with --objects.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* lt/pack-object-memuse:
show_object(): push path_name() call further down
process_{tree,blob}: show objects without buffering
Conflicts:
builtin-pack-objects.c
builtin-rev-list.c
list-objects.c
list-objects.h
upload-pack.c
In particular, pushing the "path_name()" call _into_ the show() function
would seem to allow
- more clarity into who "owns" the name (ie now when we free the name in
the show_object callback, it's because we generated it ourselves by
calling path_name())
- not calling path_name() at all, either because we don't care about the
name in the first place, or because we are actually happy walking the
linked list of "struct name_path *" and the last component.
Now, I didn't do that latter optimization, because it would require some
more coding, but especially looking at "builtin-pack-objects.c", we really
don't even want the whole pathname, we really would be better off with the
list of path components.
Why? We use that name for two things:
- add_preferred_base_object(), which actually _wants_ to traverse the
path, and now does it by looking for '/' characters!
- for 'name_hash()', which only cares about the last 16 characters of a
name, so again, generating the full name seems to be just unnecessary
work.
Anyway, so I didn't look any closer at those things, but it did convince
me that the "show_object()" calling convention was crazy, and we're
actually better off doing _less_ in list-objects.c, and giving people
access to the internal data structures so that they can decide whether
they want to generate a path-name or not.
This patch does that, and then for people who did use the name (even if
they might do something more clever in the future), it just does the
straightforward "name = path_name(path, component); .. free(name);" thing.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Here's a less trivial thing, and slightly more dubious one.
I was looking at that "struct object_array objects", and wondering why we
do that. I have honestly totally forgotten. Why not just call the "show()"
function as we encounter the objects? Rather than add the objects to the
object_array, and then at the very end going through the array and doing a
'show' on all, just do things more incrementally.
Now, there are possible downsides to this:
- the "buffer using object_array" _can_ in theory result in at least
better I-cache usage (two tight loops rather than one more spread out
one). I don't think this is a real issue, but in theory..
- this _does_ change the order of the objects printed. Instead of doing a
"process_tree(revs, commit->tree, &objects, NULL, "");" in the loop
over the commits (which puts all the root trees _first_ in the object
list, this patch just adds them to the list of pending objects, and
then we'll traverse them in that order (and thus show each root tree
object together with the objects we discover under it)
I _think_ the new ordering actually makes more sense, but the object
ordering is actually a subtle thing when it comes to packing
efficiency, so any change in order is going to have implications for
packing. Good or bad, I dunno.
- There may be some reason why we did it that odd way with the object
array, that I have simply forgotten.
Anyway, now that we don't buffer up the objects before showing them
that may actually result in lower memory usage during that whole
traverse_commit_list() phase.
This is seriously not very deeply tested. It makes sense to me, it seems
to pass all the tests, it looks ok, but...
Does anybody remember why we did that "object_array" thing? It used to be
an "object_list" a long long time ago, but got changed into the array due
to better memory usage patterns (those linked lists of obejcts are
horrible from a memory allocation standpoint). But I wonder why we didn't
do this back then. Maybe there's a reason for it.
Or maybe there _used_ to be a reason, and no longer is.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* cc/bisect-filter: (21 commits)
rev-list: add "int bisect_show_flags" in "struct rev_list_info"
rev-list: remove last static vars used in "show_commit"
list-objects: add "void *data" parameter to show functions
bisect--helper: string output variables together with "&&"
rev-list: pass "int flags" as last argument of "show_bisect_vars"
t6030: test bisecting with paths
bisect: use "bisect--helper" and remove "filter_skipped" function
bisect: implement "read_bisect_paths" to read paths in "$GIT_DIR/BISECT_NAMES"
bisect--helper: implement "git bisect--helper"
bisect: use the new generic "sha1_pos" function to lookup sha1
rev-list: call new "filter_skip" function
patch-ids: use the new generic "sha1_pos" function to lookup sha1
sha1-lookup: add new "sha1_pos" function to efficiently lookup sha1
rev-list: pass "revs" to "show_bisect_vars"
rev-list: make "show_bisect_vars" non static
rev-list: move code to show bisect vars into its own function
rev-list: move bisect related code into its own file
rev-list: make "bisect_list" variable local to "cmd_rev_list"
refs: add "for_each_ref_in" function to refactor "for_each_*_ref" functions
quote: add "sq_dequote_to_argv" to put unwrapped args in an argv array
...
The name of the processed object was duplicated for passing it to
add_object(), but that already calls path_name, which allocates a new
string anyway. So the memory allocated by the xstrdup calls just went
nowhere, leaking memory.
This reduces the RSS usage for a "rev-list --all --objects" by about 10% on
the gentoo repo (fully packed) as well as linux-2.6.git:
gentoo:
| old | new
----------------|-------------------------------
RSS | 1537284 | 1388408
VSZ | 1816852 | 1667952
time elapsed | 1:49.62 | 1:48.99
min. page faults| 417178 | 379919
linux-2.6.git:
| old | new
----------------|-------------------------------
RSS | 324452 | 292996
VSZ | 491792 | 460376
time elapsed | 0:14.53 | 0:14.28
min. page faults| 89360 | 81613
Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The goal of this patch is to get rid of the "static struct rev_info
revs" static variable in "builtin-rev-list.c".
To do that, we need to pass the revs to the "show_commit" function
in "builtin-rev-list.c" and this in turn means that the
"traverse_commit_list" function in "list-objects.c" must be passed
functions pointers to functions with 2 parameters instead of one.
So we have to change all the callers and all the functions passed
to "traverse_commit_list".
Anyway this makes the code more clean and more generic, so it
should be a good thing in the long run.
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
As these functions are directly called with the result
from lookup_tree/blob, they must handle NULL.
Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
If we were listing objects too then the objects were buffered in an
array only reachable from a stack allocated structure. When this
function returns that array would be leaked as nobody would have
a reference to it anymore.
Historically this hasn't been a problem as the primary user of
traverse_commit_list() (the noble git-rev-list) would terminate
as soon as the function was finished, thus allowing the operating
system to cleanup memory. However we have been leaking this data
in git-pack-objects ever since that program learned how to run the
revision listing internally, rather than relying on reading object
names from git-rev-list.
To better facilitate reuse of traverse_commit_list during other
builtin tools (such as git-fetch) we shouldn't leak temporary memory
like this and instead we need to clean up properly after ourselves.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This allows us to pack superprojects and thus clone them (but not yet
check them out on the receiving side.. That's the next patch)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This removes slightly more lines than it adds, but the real reason for
doing this is that future optimizations will require more setup of the
tree descriptor, and so we want to do it in one place.
Also renamed the "desc.buf" field to "desc.buffer" just to trigger
compiler errors for old-style manual initializations, making sure I
didn't miss anything.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This teaches the internal rev-list logic to understand options
that are needed for pack handling: --all, --unpacked, and --thin.
It also moves two functions from builtin-rev-list to list-objects
so that the two programs can share more code.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Create a separate file, list-objects.c, and move object listing
routines from rev-list to it. The next round will use it in
pack-objects directly.
Signed-off-by: Junio C Hamano <junkio@cox.net>