Trivially shrinks the on-disk size of the index file to save both I/O and
checksum overhead.
The topic should give a solid base to build on further updates, with the
code refactoring in its earlier parts, and the backward compatibility
mechanism in its later parts.
* jc/index-v4:
index-v4: document the entry format
unpack-trees: preserve the index file version of original
update-index: upgrade/downgrade on-disk index version
read-cache.c: write prefix-compressed names in the index
read-cache.c: read prefix-compressed names in index on-disk version v4
read-cache.c: move code to copy incore to ondisk cache to a helper function
read-cache.c: move code to copy ondisk to incore cache to a helper function
read-cache.c: report the header version we do not understand
read-cache.c: make create_from_disk() report number of bytes it consumed
read-cache.c: allow unaligned mapping of the index file
cache.h: hide on-disk index details
varint: make it available outside the context of pack
Teach the code to write the index in the v4 on-disk format.
Record the format version of the on-disk index we read from in the
index_state, and use the format when writing the new index out.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Because the entries are sorted by path, adjacent entries in the index tend
to share the leading components of them, and it makes sense to only store
the differences in later entries. In the v4 on-disk format of the index,
each on-disk cache entry stores the number of bytes to be stripped from
the end of the previous name, and the bytes to append to the result, to
come up with its name.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The function is the one that is reading from the data stream. It only is
natural to make it responsible for reporting this number, not the caller.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Both the on-disk format v2 and v3 pads the "name" field to the multiple of
eight to make sure that various quantities in network long/short type can
be accessed with ntohl/ntohs without having to worry about alignment, but
this forces us to waste disk I/O bandwidth.
Introduce ntoh_s()/ntoh_l() macros that the callers can use as if they were
the regular ntohs()/ntohl() on a field that may not be aligned correctly.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The on-disk format of the index file is a detail whose implementation is
neatly encapsulated in read-cache.c; there is no need to expose it to the
general public that include the cache.h header file.
Also add a prominent mark to read-cache.c to delineate the parts that deal
with the index file I/O routines from the remainder of the file.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The read-cache implementation defines this static function,
but it is a generally useful concept in git. Let's give
the empty blob the same treatment as the empty tree,
providing both hex and binary forms of the sha1.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When running "git add --refresh <pathspec>", we incorrectly showed the
path that is unmerged even if it is outside the specified pathspec, even
though we did honor pathspec and refreshed only the paths that matched.
Note that this cange does not affect "git update-index --refresh"; for
hysterical raisins, it does not take a pathspec (it takes real paths) and
more importantly itss command line options are parsed and executed one by
one as they are encountered, so "git update-index --refresh foo" means
"first refresh the index, and then update the entry 'foo' by hashing the
contents in file 'foo'", not "refresh only entry 'foo'".
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* rs/allocate-cache-entry-individually:
cache.h: put single NUL at end of struct cache_entry
read-cache.c: allocate index entries individually
Conflicts:
read-cache.c
If you have a deleted file and a porcelain refreshes the
cache, we print:
Unstaged changes after reset:
M file
This is technically correct, in that the file is modified,
but it's friendlier to the user if we further differentiate
the case of a deleted file (especially because this output
looks a lot like "diff --name-status", which would also make
the distinction).
Similarly, we can distinguish typechanges ("T") and
intent-to-add files ("A"), both of which appear as just "M"
in the current output.
The plumbing output for all cases remains "needs update" for
historical compatibility.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When refreshing the index, for modified (or unmerged) files we will print
"needs update" (or "needs merge") for plumbing, or line similar to the
output from "diff --name-status" for porcelain.
The variables holding which type of message to show are named after the
plumbing messages. However, as we begin to differentiate more cases at the
porcelain level (with the plumbing message staying the same), that naming
scheme will become awkward.
Instead, name the variables after which case we found (modified or
unmerged), not what we will output.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This will enable refresh_cache to differentiate more cases
of modification (such as typechange) when telling the user
what isn't fresh.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The code to estimate the in-memory size of the index based on its on-disk
representation is subtly wrong for certain architecture-dependent struct
layouts. Instead of fixing it, replace the code to keep the index entries
in a single large block of memory and allocate each entry separately
instead. This is both simpler and more flexible, as individual entries
can now be freed. Actually using that added flexibility is left for a
later patch.
Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
estimate_cache_size() tries to guess how much memory is needed for the
in-memory representation of an index file. It does that by using the
file size, the number of entries and the difference of the sizes of the
on-disk and in-memory structs -- without having to check the length of
the name of each entry, which varies for each entry, but their sums are
the same no matter the representation.
Except there can be a difference. First of all, the size is really
calculated by ce_size and ondisk_ce_size based on offsetof(..., name),
not sizeof, which can be different. And entries are padded with 1 to 8
NULs at the end (after the variable name) to make their total length a
multiple of eight.
So in order to allocate enough memory to hold the index, change the
delta calculation to be based on offsetof(..., name) and round up to
the next multiple of eight.
On a 32-bit Linux, this delta was used before:
sizeof(struct cache_entry) == 72
sizeof(struct ondisk_cache_entry) == 64
---
8
The actual difference for an entry with a filename length of one was,
however (find the definitions are in cache.h):
offsetof(struct cache_entry, name) == 72
offsetof(struct ondisk_cache_entry, name) == 62
ce_size == (72 + 1 + 8) & ~7 == 80
ondisk_ce_size == (62 + 1 + 8) & ~7 == 64
---
16
So eight bytes less had been allocated for such entries. The new
formula yields the correct delta:
(72 - 62 + 7) & ~7 == 16
Reported-by: John Hsing <tsyj2007@gmail.com>
Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* ef/maint-win-verify-path:
verify_dotfile(): do not assume '/' is the path seperator
verify_path(): simplify check at the directory boundary
verify_path: consider dos drive prefix
real_path: do not assume '/' is the path seperator
A Windows path starting with a backslash is absolute
verify_dotfile() currently assumes that the path seperator is '/', but on
Windows it can also be '\\', so use is_dir_sep() instead.
Signed-off-by: Theo Niessink <theo@taletn.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We simply want to say "At a directory boundary, be careful with a name
that begins with a dot, forbid a name that ends with the boundary
character or has duplicated bounadry characters".
Signed-off-by: Junio C Hamano <gitster@pobox.com>
If someone manage to create a repo with a 'C:' entry in the
root-tree, files can be written outside of the working-dir. This
opens up a can-of-worms of exploits.
Fix it by explicitly checking for a dos drive prefix when verifying
a paht. While we're at it, make sure that paths beginning with '\' is
considered absolute as well.
Noticed-by: Theo Niessink <theo@taletn.com>
Signed-off-by: Erik Faye-Lund <kusmabite@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The "format_check" parameter tucked after the existing parameters is too
ugly an afterthought to live in any reasonable API.
Combine it with the other boolean parameter "write_object" into a single
"flags" parameter.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Traditional "opportunistic index update" done by read-only "diff" and
"status" was about updating cached lstat(2) information in the index for
the next round. We missed another obvious optimization opportunity: when
there are racily clean entries that will cease to be racily clean by
updating $GIT_INDEX_FILE. Detect that case and write $GIT_INDEX_FILE out
to give it a newer timestamp.
Noticed by Lasse Makholm by stracing "git status" in a fresh checkout and
counting the number of open(2) calls.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When we had to refresh the index internally before running diff or status,
we opportunistically updated the $GIT_INDEX_FILE so that later invocation
of git can use the lstat(2) we already did in this invocation.
Make them share a helper function to do so.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* nd/struct-pathspec: (22 commits)
t6004: add pathspec globbing test for log family
t7810: overlapping pathspecs and depth limit
grep: drop pathspec_matches() in favor of tree_entry_interesting()
grep: use writable strbuf from caller for grep_tree()
grep: use match_pathspec_depth() for cache/worktree grepping
grep: convert to use struct pathspec
Convert ce_path_match() to use match_pathspec_depth()
Convert ce_path_match() to use struct pathspec
struct rev_info: convert prune_data to struct pathspec
pathspec: add match_pathspec_depth()
tree_entry_interesting(): optimize wildcard matching when base is matched
tree_entry_interesting(): support wildcard matching
tree_entry_interesting(): fix depth limit with overlapping pathspecs
tree_entry_interesting(): support depth limit
tree_entry_interesting(): refactor into separate smaller functions
diff-tree: convert base+baselen to writable strbuf
glossary: define pathspec
Move tree_entry_interesting() to tree-walk.c and export it
tree_entry_interesting(): remove dependency on struct diff_options
Convert struct diff_options to use struct pathspec
...
Commits, trees and tags have structure. Don't let users feed git
with malformed ones. Sooner or later git will die() when
encountering them.
Note that this patch does not check semantics. A tree that points
to non-existent objects is perfectly OK (and should be so, users
may choose to add commit first, then its associated tree for example).
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* jj/icase-directory:
Support case folding in git fast-import when core.ignorecase=true
Support case folding for git add when core.ignorecase=true
Add case insensitivity support when using git ls-files
Add case insensitivity support for directories when using git status
Case insensitivity support for .gitignore via core.ignorecase
Add string comparison functions that respect the ignore_case variable.
Makefile & configure: add a NO_FNMATCH_CASEFOLD flag
Makefile & configure: add a NO_FNMATCH flag
Conflicts:
Makefile
config.mak.in
configure.ac
fast-import.c
When MyDir/ABC/filea.txt is added to Git, the disk directory MyDir/ABC/
is renamed to mydir/aBc/, and then mydir/aBc/fileb.txt is added, the
index will contain MyDir/ABC/filea.txt and mydir/aBc/fileb.txt. Although
the earlier portions of this patch series account for those differences
in case, this patch makes the pathing consistent by folding the case of
newly added files against the first file added with that path.
In read-cache.c's add_to_index(), the index_name_exists() support used
for git status's case insensitive directory lookups is used to find the
proper directory case according to what the user already checked in.
That is, MyDir/ABC/'s case is used to alter the stored path for
fileb.txt to MyDir/ABC/fileb.txt (instead of mydir/aBc/fileb.txt).
This is especially important when cloning a repository to a case
sensitive file system. MyDir/ABC/ and mydir/aBc/ exist in the same
directory on a Windows machine, but on Linux, the files exist in two
separate directories. The update to add_to_index(), in effect, treats a
Windows file system as case sensitive by making path case consistent.
Signed-off-by: Joshua Jensen <jjensen@workspacewhiz.com>
Signed-off-by: Johannes Sixt <j6t@kdbg.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The new dircache extension CACHE_EXT_RESOLVE_UNDO, whose value is
0x52455543, is actually the ASCII sequence 'REUC', not the ASCII
sequence 'REUN'.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The rule has always been that a cache entry that is ce_uptodate(ce)
means that we already have checked the work tree entity and we know
there is no change in the work tree compared to the index, and nobody
should have to double check. Note that false ce_uptodate(ce) does not
mean it is known to be dirty---it only means we don't know if it is
clean.
There are a few codepaths (refresh-index and preload-index are among
them) that mark a cache entry as up-to-date based solely on the return
value from ie_match_stat(); this function uses lstat() to see if the
work tree entity has been touched, and for a submodule entry, if its
HEAD points at the same commit as the commit recorded in the index of
the superproject (a submodule that is not even cloned is considered
clean).
A submodule is no longer considered unmodified merely because its HEAD
matches the index of the superproject these days, in order to prevent
people from forgetting to commit in the submodule and updating the
superproject index with the new submodule commit, before commiting the
state in the superproject. However, the patch to do so didn't update
the codepath that marks cache entries up-to-date based on the updated
definition and instead worked it around by saying "we don't trust the
return value of ce_uptodate() for submodules."
This makes ce_uptodate() trustworthy again by not marking submodule
entries up-to-date.
The next step _could_ be to introduce a few "in-core" flag bits to
cache_entry structure to record "this entry is _known_ to be dirty",
call is_submodule_modified() from ie_match_stat(), and use these new
bits to avoid running this rather expensive check more than once, but
that can be a separate patch.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Exal Sibeaz pointed out that some git files are way too big, and that
add_files_to_cache() brings in all the diff machinery to any git binary
that needs the basic git SHA1 object operations from read-cache.c. Which
is pretty much all of them.
It's doubly silly, since add_files_to_cache() is only used by builtin
programs (add, checkout and commit), so it's fairly easily fixed by just
moving the thing to builtin-add.c, and avoiding the dependency entirely.
I initially argued to Exal that it would probably be best to try to depend
on smart compilers and linkers, but after spending some time trying to
make -ffunction-sections work and giving up, I think Exal was right, and
the fix is to just do some trivial cleanups like this.
This trivial cleanup results in pretty stunning file size differences.
The diff machinery really is mostly used by just the builtin programs, and
you have things like these trivial before-and-after numbers:
-rwxr-xr-x 1 torvalds torvalds 1727420 2010-01-21 10:53 git-hash-object
-rwxrwxr-x 1 torvalds torvalds 940265 2010-01-21 11:16 git-hash-object
Now, I'm not saying that 940kB is good either, but that's mostly all the
debug information - you can see the real code with 'size':
text data bss dec hex filename
418675 3920 127408 550003 86473 git-hash-object (before)
230650 2288 111728 344666 5425a git-hash-object (after)
ie we have a nice 24% size reduction from this trivial cleanup.
It's not just that one file either. I get:
[torvalds@nehalem git]$ du -s /home/torvalds/libexec/git-core
45640 /home/torvalds/libexec/git-core (before)
33508 /home/torvalds/libexec/git-core (after)
so we're talking 12MB of diskspace here.
(Of course, stripping all the binaries brings the 33MB down to 9MB, so the
whole debug information thing is still the bulk of it all, but that's a
separate issue entirely)
Now, I'm sure there are other things we should do, and changing our
compiler flags from -O2 to -Os would bring the text size down by an
additional almost 20%, but this thing Exal pointed out seems to be some
good low-hanging fruit.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* jc/cache-unmerge:
rerere forget path: forget recorded resolution
rerere: refactor rerere logic to make it independent from I/O
rerere: remove silly 1024-byte line limit
resolve-undo: teach "update-index --unresolve" to use resolve-undo info
resolve-undo: "checkout -m path" uses resolve-undo information
resolve-undo: allow plumbing to clear the information
resolve-undo: basic tests
resolve-undo: record resolved conflicts in a new index extension section
builtin-merge.c: use standard active_cache macros
Conflicts:
builtin-ls-files.c
builtin-merge.c
builtin-rerere.c
* jc/symbol-static:
date.c: mark file-local function static
Replace parse_blob() with an explanatory comment
symlinks.c: remove unused functions
object.c: remove unused functions
strbuf.c: remove unused function
sha1_file.c: remove unused function
mailmap.c: remove unused function
utf8.c: mark file-local function static
submodule.c: mark file-local function static
quote.c: mark file-local function static
remote-curl.c: mark file-local function static
read-cache.c: mark file-local functions static
parse-options.c: mark file-local function static
entry.c: mark file-local function static
http.c: mark file-local functions static
pretty.c: mark file-local function static
builtin-rev-list.c: mark file-local function static
bisect.c: mark file-local function static
* cc/reset-more:
t7111: check that reset options work as described in the tables
Documentation: reset: add some missing tables
Fix bit assignment for CE_CONFLICTED
"reset --merge": fix unmerged case
reset: use "unpack_trees()" directly instead of "git read-tree"
reset: add a few tests for "git reset --merge"
Documentation: reset: add some tables to describe the different options
reset: improve mixed reset error message when in a bare repo
* nd/sparse: (25 commits)
t7002: test for not using external grep on skip-worktree paths
t7002: set test prerequisite "external-grep" if supported
grep: do not do external grep on skip-worktree entries
commit: correctly respect skip-worktree bit
ie_match_stat(): do not ignore skip-worktree bit with CE_MATCH_IGNORE_VALID
tests: rename duplicate t1009
sparse checkout: inhibit empty worktree
Add tests for sparse checkout
read-tree: add --no-sparse-checkout to disable sparse checkout support
unpack-trees(): ignore worktree check outside checkout area
unpack_trees(): apply $GIT_DIR/info/sparse-checkout to the final index
unpack-trees(): "enable" sparse checkout and load $GIT_DIR/info/sparse-checkout
unpack-trees.c: generalize verify_* functions
unpack-trees(): add CE_WT_REMOVE to remove on worktree alone
Introduce "sparse checkout"
dir.c: export excluded_1() and add_excludes_from_file_1()
excluded_1(): support exclude files in index
unpack-trees(): carry skip-worktree bit over in merged_entry()
Read .gitignore from index if it is skip-worktree
Avoid writing to buffer in add_excludes_from_file_1()
...
Conflicts:
.gitignore
Documentation/config.txt
Documentation/git-update-index.txt
Makefile
entry.c
t/t7002-grep.sh
Commit 9e8ecea (Add 'merge' mode to 'git reset', 2008-12-01) disallowed
"git reset --merge" when there was unmerged entries. But it wished if
unmerged entries were reset as if --hard (instead of --merge) has been
used. This makes sense because all "mergy" operations makes sure that
any path involved in the merge does not have local modifications before
starting, so resetting such a path away won't lose any information.
The previous commit changed the behavior of --merge to accept resetting
unmerged entries if they are reset to a different state than HEAD, but it
did not reset the changes in the work tree, leaving the conflict markers
in the resulting file in the work tree.
Fix it by doing three things:
- Update the documentation to match the wish of original "reset --merge"
better, namely, "An unmerged entry is a sign that the path didn't have
any local modification and can be safely resetted to whatever the new
HEAD records";
- Update read_index_unmerged(), which reads the index file into the cache
while dropping any higher-stage entries down to stage #0, not to copy
the object name from the higher stage entry. The code used to take the
object name from the a stage entry ("base" if you happened to have
stage #1, or "ours" if both sides added, etc.), which essentially meant
that you are getting random results depending on what the merge did.
The _only_ reason we want to keep a previously unmerged entry in the
index at stage #0 is so that we don't forget the fact that we have
corresponding file in the work tree in order to be able to remove it
when the tree we are resetting to does not have the path. In order to
differentiate such an entry from ordinary cache entry, the cache entry
added by read_index_unmerged() is marked as CE_CONFLICTED.
- Update merged_entry() and deleted_entry() so that they pay attention to
cache entries marked as CE_CONFLICTED. They are previously unmerged
entries, and the files in the work tree that correspond to them are
resetted away by oneway_merge() to the version from the tree we are
resetting to.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
On big endian platforms with 8-byte unsigned long, the code reads the
size of the index extension section (which is a 4-byte network byte
order integer) incorrectly.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When resolving a conflict using "git add" to create a stage #0 entry, or
"git rm" to remove entries at higher stages, remove_index_entry_at()
function is eventually called to remove unmerged (i.e. higher stage)
entries from the index. Introduce a "resolve_undo_info" structure and
keep track of the removed cache entries, and save it in a new index
extension section in the index_state.
Operations like "read-tree -m", "merge", "checkout [-m] <branch>" and
"reset" are signs that recorded information in the index is no longer
necessary. The data is removed from the index extension when operations
start; they may leave conflicted entries in the index, and later user
actions like "git add" will record their conflicted states afresh.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Previously CE_MATCH_IGNORE_VALID flag is used by both valid and
skip-worktree bits. While the two bits have similar behaviour, sharing
this flag means "git update-index --really-refresh" will ignore
skip-worktree while it should not. Instead another flag is
introduced to ignore skip-worktree bit, CE_MATCH_IGNORE_VALID only
applies to valid bit.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>