mirror of
https://github.com/git/git.git
synced 2024-10-31 06:17:56 +01:00
graph: add commit graph design document
Add Documentation/technical/commit-graph.txt with details of the planned commit graph feature, including future plans. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
This commit is contained in:
parent
b84f767c8a
commit
ae30d7b115
1 changed files with 163 additions and 0 deletions
163
Documentation/technical/commit-graph.txt
Normal file
163
Documentation/technical/commit-graph.txt
Normal file
|
@ -0,0 +1,163 @@
|
|||
Git Commit Graph Design Notes
|
||||
=============================
|
||||
|
||||
Git walks the commit graph for many reasons, including:
|
||||
|
||||
1. Listing and filtering commit history.
|
||||
2. Computing merge bases.
|
||||
|
||||
These operations can become slow as the commit count grows. The merge
|
||||
base calculation shows up in many user-facing commands, such as 'merge-base'
|
||||
or 'status' and can take minutes to compute depending on history shape.
|
||||
|
||||
There are two main costs here:
|
||||
|
||||
1. Decompressing and parsing commits.
|
||||
2. Walking the entire graph to satisfy topological order constraints.
|
||||
|
||||
The commit graph file is a supplemental data structure that accelerates
|
||||
commit graph walks. If a user downgrades or disables the 'core.commitGraph'
|
||||
config setting, then the existing ODB is sufficient. The file is stored
|
||||
as "commit-graph" either in the .git/objects/info directory or in the info
|
||||
directory of an alternate.
|
||||
|
||||
The commit graph file stores the commit graph structure along with some
|
||||
extra metadata to speed up graph walks. By listing commit OIDs in lexi-
|
||||
cographic order, we can identify an integer position for each commit and
|
||||
refer to the parents of a commit using those integer positions. We use
|
||||
binary search to find initial commits and then use the integer positions
|
||||
for fast lookups during the walk.
|
||||
|
||||
A consumer may load the following info for a commit from the graph:
|
||||
|
||||
1. The commit OID.
|
||||
2. The list of parents, along with their integer position.
|
||||
3. The commit date.
|
||||
4. The root tree OID.
|
||||
5. The generation number (see definition below).
|
||||
|
||||
Values 1-4 satisfy the requirements of parse_commit_gently().
|
||||
|
||||
Define the "generation number" of a commit recursively as follows:
|
||||
|
||||
* A commit with no parents (a root commit) has generation number one.
|
||||
|
||||
* A commit with at least one parent has generation number one more than
|
||||
the largest generation number among its parents.
|
||||
|
||||
Equivalently, the generation number of a commit A is one more than the
|
||||
length of a longest path from A to a root commit. The recursive definition
|
||||
is easier to use for computation and observing the following property:
|
||||
|
||||
If A and B are commits with generation numbers N and M, respectively,
|
||||
and N <= M, then A cannot reach B. That is, we know without searching
|
||||
that B is not an ancestor of A because it is further from a root commit
|
||||
than A.
|
||||
|
||||
Conversely, when checking if A is an ancestor of B, then we only need
|
||||
to walk commits until all commits on the walk boundary have generation
|
||||
number at most N. If we walk commits using a priority queue seeded by
|
||||
generation numbers, then we always expand the boundary commit with highest
|
||||
generation number and can easily detect the stopping condition.
|
||||
|
||||
This property can be used to significantly reduce the time it takes to
|
||||
walk commits and determine topological relationships. Without generation
|
||||
numbers, the general heuristic is the following:
|
||||
|
||||
If A and B are commits with commit time X and Y, respectively, and
|
||||
X < Y, then A _probably_ cannot reach B.
|
||||
|
||||
This heuristic is currently used whenever the computation is allowed to
|
||||
violate topological relationships due to clock skew (such as "git log"
|
||||
with default order), but is not used when the topological order is
|
||||
required (such as merge base calculations, "git log --graph").
|
||||
|
||||
In practice, we expect some commits to be created recently and not stored
|
||||
in the commit graph. We can treat these commits as having "infinite"
|
||||
generation number and walk until reaching commits with known generation
|
||||
number.
|
||||
|
||||
Design Details
|
||||
--------------
|
||||
|
||||
- The commit graph file is stored in a file named 'commit-graph' in the
|
||||
.git/objects/info directory. This could be stored in the info directory
|
||||
of an alternate.
|
||||
|
||||
- The core.commitGraph config setting must be on to consume graph files.
|
||||
|
||||
- The file format includes parameters for the object ID hash function,
|
||||
so a future change of hash algorithm does not require a change in format.
|
||||
|
||||
Future Work
|
||||
-----------
|
||||
|
||||
- The commit graph feature currently does not honor commit grafts. This can
|
||||
be remedied by duplicating or refactoring the current graft logic.
|
||||
|
||||
- The 'commit-graph' subcommand does not have a "verify" mode that is
|
||||
necessary for integration with fsck.
|
||||
|
||||
- The file format includes room for precomputed generation numbers. These
|
||||
are not currently computed, so all generation numbers will be marked as
|
||||
0 (or "uncomputed"). A later patch will include this calculation.
|
||||
|
||||
- After computing and storing generation numbers, we must make graph
|
||||
walks aware of generation numbers to gain the performance benefits they
|
||||
enable. This will mostly be accomplished by swapping a commit-date-ordered
|
||||
priority queue with one ordered by generation number. The following
|
||||
operations are important candidates:
|
||||
|
||||
- paint_down_to_common()
|
||||
- 'log --topo-order'
|
||||
|
||||
- Currently, parse_commit_gently() requires filling in the root tree
|
||||
object for a commit. This passes through lookup_tree() and consequently
|
||||
lookup_object(). Also, it calls lookup_commit() when loading the parents.
|
||||
These method calls check the ODB for object existence, even if the
|
||||
consumer does not need the content. For example, we do not need the
|
||||
tree contents when computing merge bases. Now that commit parsing is
|
||||
removed from the computation time, these lookup operations are the
|
||||
slowest operations keeping graph walks from being fast. Consider
|
||||
loading these objects without verifying their existence in the ODB and
|
||||
only loading them fully when consumers need them. Consider a method
|
||||
such as "ensure_tree_loaded(commit)" that fully loads a tree before
|
||||
using commit->tree.
|
||||
|
||||
- The current design uses the 'commit-graph' subcommand to generate the graph.
|
||||
When this feature stabilizes enough to recommend to most users, we should
|
||||
add automatic graph writes to common operations that create many commits.
|
||||
For example, one could compute a graph on 'clone', 'fetch', or 'repack'
|
||||
commands.
|
||||
|
||||
- A server could provide a commit graph file as part of the network protocol
|
||||
to avoid extra calculations by clients. This feature is only of benefit if
|
||||
the user is willing to trust the file, because verifying the file is correct
|
||||
is as hard as computing it from scratch.
|
||||
|
||||
Related Links
|
||||
-------------
|
||||
[0] https://bugs.chromium.org/p/git/issues/detail?id=8
|
||||
Chromium work item for: Serialized Commit Graph
|
||||
|
||||
[1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/
|
||||
An abandoned patch that introduced generation numbers.
|
||||
|
||||
[2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/
|
||||
Discussion about generation numbers on commits and how they interact
|
||||
with fsck.
|
||||
|
||||
[3] https://public-inbox.org/git/20170908034739.4op3w4f2ma5s65ku@sigill.intra.peff.net/
|
||||
More discussion about generation numbers and not storing them inside
|
||||
commit objects. A valuable quote:
|
||||
|
||||
"I think we should be moving more in the direction of keeping
|
||||
repo-local caches for optimizations. Reachability bitmaps have been
|
||||
a big performance win. I think we should be doing the same with our
|
||||
properties of commits. Not just generation numbers, but making it
|
||||
cheap to access the graph structure without zlib-inflating whole
|
||||
commit objects (i.e., packv4 or something like the "metapacks" I
|
||||
proposed a few years ago)."
|
||||
|
||||
[4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u
|
||||
A patch to remove the ahead-behind calculation from 'status'.
|
Loading…
Reference in a new issue