The pack revindex stores the offsets of the objects in the
pack in sorted order, allowing us to easily find the on-disk
size of each object. To compute it, we populate an array
with the offsets from the sha1-sorted idx file, and then use
qsort to order it by offsets.
That does O(n log n) offset comparisons, and profiling shows
that we spend most of our time in cmp_offset. However, since
we are sorting on a simple off_t, we can use numeric sorts
that perform better. A radix sort can run in O(k*n), where k
is the number of "digits" in our number. For a 64-bit off_t,
using 16-bit "digits" gives us k=4.
On the linux.git repo, with about 3M objects to sort, this
yields a 400% speedup. Here are the best-of-five numbers for
running
echo HEAD | git cat-file --batch-check="%(objectsize:disk)
on a fully packed repository, which is dominated by time
spent building the pack revindex:
before after
real 0m0.834s 0m0.204s
user 0m0.788s 0m0.164s
sys 0m0.040s 0m0.036s
This matches our algorithmic expectations. log(3M) is ~21.5,
so a traditional sort is ~21.5n. Our radix sort runs in k*n,
where k is the number of radix digits. In the worst case,
this is k=4 for a 64-bit off_t, but we can quit early when
the largest value to be sorted is smaller. For any
repository under 4G, k=2. Our algorithm makes two passes
over the list per radix digit, so we end up with 4n. That
should yield ~5.3x speedup. We see 4x here; the difference
is probably due to the extra bucket book-keeping the radix
sort has to do.
On a smaller repo, the difference is less impressive, as
log(n) is smaller. For git.git, with 173K objects (but still
k=2), we see a 2.7x improvement:
before after
real 0m0.046s 0m0.017s
user 0m0.036s 0m0.012s
sys 0m0.008s 0m0.000s
On even tinier repos (e.g., a few hundred objects), the
speedup goes away entirely, as the small advantage of the
radix sort gets erased by the book-keeping costs (and at
those sizes, the cost to generate the the rev-index gets
lost in the noise anyway).
Signed-off-by: Jeff King <peff@peff.net>
Reviewed-by: Brandon Casey <drafnel@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A packfile may have up to 2^32-1 objects in it, so the
"right" data type to use is uint32_t. We currently use a
signed int, which means that we may behave incorrectly for
packfiles with more than 2^31-1 objects on 32-bit systems.
Nobody has noticed because having 2^31 objects is pretty
insane. The linux.git repo has on the order of 2^22 objects,
which is hundreds of times smaller than necessary to trigger
the bug.
Let's bump this up to an "unsigned". On 32-bit systems, this
gives us the correct data-type, and on 64-bit systems, it is
probably more efficient to use the native "unsigned" than a
true uint32_t.
While we're at it, we can fix the binary search not to
overflow in such a case if our unsigned is 32 bits.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
It currently calls die() whenever given offset is not found thinking
that such thing should never happen. But this offset may come from a
corrupted pack whych _could_ happen and not be found. Callers should
deal with this possibility gracefully instead.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This is needed to fix verify-pack -v with multiple pack arguments.
Also, in theory, revindex data (if any) must be discarded whenever
reprepare_packed_git() is called. In practice this is hard to trigger
though.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This makes life much easier for next patch, as well as being more efficient
when the revindex is actually not used.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
No functional change. This is needed to fix verify-pack in a later patch.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>