mirrors/git - Incest Forge: Beyond sex. We incest.

mirrors/git

mirror of https://github.com/git/git.git synced 2024-11-01 06:47:52 +01:00

320 lines

9 KiB

C

Raw Normal View History

read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00			`#include "cache.h"`
			`#include "split-index.h"`
split-index: the writing part prepare_to_write_split_index() does the major work, classifying deleted, updated and added entries. write_link_extension() then just writes it down. An observation is, deleting an entry, then adding it back is recorded as "entry X is deleted, entry X is added", not "entry X is replaced". This is simpler, with small overhead: a replaced entry is stored without its path, a new entry is store with its path. A note about unpack_trees() and the deduplication code inside prepare_to_write_split_index(). Usually tracking updated/removed entries via read-cache API is enough. unpack_trees() manipulates the index in a different way: it throws the entire source index out, builds up a new one, copying/duplicating entries (using dup_entry) from the source index over if necessary, then returns the new index. A naive solution would be marking the entire source index "deleted" and add their duplicates as new. That could bring $GIT_DIR/index back to the original size. So we try harder and memcmp() between the original and the duplicate to see if it needs updating. We could avoid memcmp() too, by avoiding duplicating the original entry in dup_entry(). The performance gain this way is within noise level and it complicates unpack-trees.c. So memcmp() is the preferred way to deal with deduplication. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:40 +02:00			`#include "ewah/ewok.h"`
read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00
			`struct split_index init_split_index(struct index_state istate)`
			`{`
			`if (!istate->split_index) {`
			`istate->split_index = xcalloc(1, sizeof(*istate->split_index));`
			`istate->split_index->refcount = 1;`
			`}`
			`return istate->split_index;`
			`}`

			`int read_link_extension(struct index_state *istate,`
			`const void *data_, unsigned long sz)`
			`{`
			`const unsigned char *data = data_;`
			`struct split_index *si;`
split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00			`int ret;`

read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00			`if (sz < 20)`
			`return error("corrupt link extension (too short)");`
			`si = init_split_index(istate);`
			`hashcpy(si->base_sha1, data);`
			`data += 20;`
			`sz -= 20;`
split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00			`if (!sz)`
			`return 0;`
			`si->delete_bitmap = ewah_new();`
			`ret = ewah_read_mmap(si->delete_bitmap, data, sz);`
			`if (ret < 0)`
			`return error("corrupt delete bitmap in link extension");`
			`data += ret;`
			`sz -= ret;`
			`si->replace_bitmap = ewah_new();`
			`ret = ewah_read_mmap(si->replace_bitmap, data, sz);`
			`if (ret < 0)`
			`return error("corrupt replace bitmap in link extension");`
			`if (ret != sz)`
read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00			`return error("garbage at the end of link extension");`
			`return 0;`
			`}`

			`int write_link_extension(struct strbuf *sb,`
			`struct index_state *istate)`
			`{`
			`struct split_index *si = istate->split_index;`
			`strbuf_add(sb, si->base_sha1, 20);`
split-index: the writing part prepare_to_write_split_index() does the major work, classifying deleted, updated and added entries. write_link_extension() then just writes it down. An observation is, deleting an entry, then adding it back is recorded as "entry X is deleted, entry X is added", not "entry X is replaced". This is simpler, with small overhead: a replaced entry is stored without its path, a new entry is store with its path. A note about unpack_trees() and the deduplication code inside prepare_to_write_split_index(). Usually tracking updated/removed entries via read-cache API is enough. unpack_trees() manipulates the index in a different way: it throws the entire source index out, builds up a new one, copying/duplicating entries (using dup_entry) from the source index over if necessary, then returns the new index. A naive solution would be marking the entire source index "deleted" and add their duplicates as new. That could bring $GIT_DIR/index back to the original size. So we try harder and memcmp() between the original and the duplicate to see if it needs updating. We could avoid memcmp() too, by avoiding duplicating the original entry in dup_entry(). The performance gain this way is within noise level and it complicates unpack-trees.c. So memcmp() is the preferred way to deal with deduplication. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:40 +02:00			`if (!si->delete_bitmap && !si->replace_bitmap)`
			`return 0;`
ewah: add convenient wrapper ewah_serialize_strbuf() Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2015-03-08 11:12:32 +01:00			`ewah_serialize_strbuf(si->delete_bitmap, sb);`
			`ewah_serialize_strbuf(si->replace_bitmap, sb);`
read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00			`return 0;`
			`}`

			`static void mark_base_index_entries(struct index_state *base)`
			`{`
			`int i;`
			`/*`
			`* To keep track of the shared entries between`
			`* istate->base->cache[] and istate->cache[], base entry`
			`* position is stored in each base entry. All positions start`
typofix: assorted typofixes in comments, documentation and messages Many instances of duplicate words (e.g. "the the path") and a few typoes are fixed, originally in multiple patches. wildmatch: fix duplicate words of "the" t: fix duplicate words of "output" transport-helper: fix duplicate words of "read" Git.pm: fix duplicate words of "return" path: fix duplicate words of "look" pack-protocol.txt: fix duplicate words of "the" precompose-utf8: fix typo of "sequences" split-index: fix typo worktree.c: fix typo remote-ext: fix typo utf8: fix duplicate words of "the" git-cvsserver: fix duplicate words Signed-off-by: Li Peng <lip@dtdream.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2016-05-06 14:36:46 +02:00			`* from 1 instead of 0, which is reserved to say "this is a new`
read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00			`* entry".`
			`*/`
			`for (i = 0; i < base->cache_nr; i++)`
			`base->cache[i]->index = i + 1;`
			`}`

update-index: new options to enable/disable split index mode If you have a large work tree but only make changes in a subset, then $GIT_DIR/index's size should be stable after a while. If you change branches that touch something else, $GIT_DIR/index's size may grow large that it becomes as slow as the unified index. Do --split-index again occasionally to force all changes back to the shared index and keep $GIT_DIR/index small. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:44 +02:00			`void move_cache_to_base_index(struct index_state *istate)`
			`{`
			`struct split_index *si = istate->split_index;`
			`int i;`

			`/*`
			`* do not delete old si->base, its index entries may be shared`
			`* with istate->cache[]. Accept a bit of leaking here because`
			`* this code is only used by short-lived update-index.`
			`*/`
			`si->base = xcalloc(1, sizeof(*si->base));`
			`si->base->version = istate->version;`
			`/* zero timestamp disables racy test in ce_write_index() */`
			`si->base->timestamp = istate->timestamp;`
			`ALLOC_GROW(si->base->cache, istate->cache_nr, si->base->cache_alloc);`
			`si->base->cache_nr = istate->cache_nr;`
use COPY_ARRAY Add a semantic patch for converting certain calls of memcpy(3) to COPY_ARRAY() and apply that transformation to the code base. The result is shorter and safer code. For now only consider calls where source and destination have the same type, or in other words: easy cases. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2016-09-25 09:24:03 +02:00			`COPY_ARRAY(si->base->cache, istate->cache, istate->cache_nr);`
update-index: new options to enable/disable split index mode If you have a large work tree but only make changes in a subset, then $GIT_DIR/index's size should be stable after a while. If you change branches that touch something else, $GIT_DIR/index's size may grow large that it becomes as slow as the unified index. Do --split-index again occasionally to force all changes back to the shared index and keep $GIT_DIR/index small. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:44 +02:00			`mark_base_index_entries(si->base);`
			`for (i = 0; i < si->base->cache_nr; i++)`
			`si->base->cache[i]->ce_flags &= ~CE_UPDATE_IN_BASE;`
			`}`

split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00			`static void mark_entry_for_delete(size_t pos, void *data)`
			`{`
			`struct index_state *istate = data;`
			`if (pos >= istate->cache_nr)`
			`die("position for delete %d exceeds base index size %d",`
			`(int)pos, istate->cache_nr);`
			`istate->cache[pos]->ce_flags \|= CE_REMOVE;`
			`istate->split_index->nr_deletions = 1;`
			`}`

			`static void replace_entry(size_t pos, void *data)`
			`{`
			`struct index_state *istate = data;`
			`struct split_index *si = istate->split_index;`
			`struct cache_entry dst, src;`
split-index: strip pathname of on-disk replaced entries We know the positions of replaced entries via the replace bitmap in "link" extension, so the "name" path does not have to be stored (it's still in the shared index). With this, we also have a way to distinguish additions vs replacements at load time and can catch broken "link" extensions. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:43 +02:00
split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00			`if (pos >= istate->cache_nr)`
			`die("position for replacement %d exceeds base index size %d",`
			`(int)pos, istate->cache_nr);`
			`if (si->nr_replacements >= si->saved_cache_nr)`
			`die("too many replacements (%d vs %d)",`
			`si->nr_replacements, si->saved_cache_nr);`
			`dst = istate->cache[pos];`
			`if (dst->ce_flags & CE_REMOVE)`
			`die("entry %d is marked as both replaced and deleted",`
			`(int)pos);`
			`src = si->saved_cache[si->nr_replacements];`
split-index: strip pathname of on-disk replaced entries We know the positions of replaced entries via the replace bitmap in "link" extension, so the "name" path does not have to be stored (it's still in the shared index). With this, we also have a way to distinguish additions vs replacements at load time and can catch broken "link" extensions. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:43 +02:00			`if (ce_namelen(src))`
			`die("corrupt link extension, entry %d should have "`
			`"zero length name", (int)pos);`
split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00			`src->index = pos + 1;`
			`src->ce_flags \|= CE_UPDATE_IN_BASE;`
split-index: strip pathname of on-disk replaced entries We know the positions of replaced entries via the replace bitmap in "link" extension, so the "name" path does not have to be stored (it's still in the shared index). With this, we also have a way to distinguish additions vs replacements at load time and can catch broken "link" extensions. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:43 +02:00			`src->ce_namelen = dst->ce_namelen;`
			`copy_cache_entry(dst, src);`
			`free(src);`
split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00			`si->nr_replacements++;`
			`}`

read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00			`void merge_base_index(struct index_state *istate)`
			`{`
			`struct split_index *si = istate->split_index;`
split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00			`unsigned int i;`
read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00
			`mark_base_index_entries(si->base);`
split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00
			`si->saved_cache = istate->cache;`
			`si->saved_cache_nr = istate->cache_nr;`
			`istate->cache_nr = si->base->cache_nr;`
			`istate->cache = NULL;`
			`istate->cache_alloc = 0;`
read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00			`ALLOC_GROW(istate->cache, istate->cache_nr, istate->cache_alloc);`
use COPY_ARRAY Add a semantic patch for converting certain calls of memcpy(3) to COPY_ARRAY() and apply that transformation to the code base. The result is shorter and safer code. For now only consider calls where source and destination have the same type, or in other words: easy cases. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2016-09-25 09:24:03 +02:00			`COPY_ARRAY(istate->cache, si->base->cache, istate->cache_nr);`
split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00
			`si->nr_deletions = 0;`
			`si->nr_replacements = 0;`
			`ewah_each_bit(si->replace_bitmap, replace_entry, istate);`
			`ewah_each_bit(si->delete_bitmap, mark_entry_for_delete, istate);`
			`if (si->nr_deletions)`
			`remove_marked_cache_entries(istate);`

			`for (i = si->nr_replacements; i < si->saved_cache_nr; i++) {`
split-index: strip pathname of on-disk replaced entries We know the positions of replaced entries via the replace bitmap in "link" extension, so the "name" path does not have to be stored (it's still in the shared index). With this, we also have a way to distinguish additions vs replacements at load time and can catch broken "link" extensions. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:43 +02:00			`if (!ce_namelen(si->saved_cache[i]))`
			`die("corrupt link extension, entry %d should "`
			`"have non-zero length name", i);`
split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00			`add_index_entry(istate, si->saved_cache[i],`
			`ADD_CACHE_OK_TO_ADD \|`
split-index: do not invalidate cache-tree at read time We are sure that after merge_base_index() is done. cache-tree can still be used with the final index. So don't destroy cache tree. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:42 +02:00			`ADD_CACHE_KEEP_CACHE_TREE \|`
split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00			`/*`
			`* we may have to replay what`
			`* merge-recursive.c:update_stages()`
			`* does, which has this flag on`
			`*/`
			`ADD_CACHE_SKIP_DFCHECK);`
			`si->saved_cache[i] = NULL;`
			`}`

			`ewah_free(si->delete_bitmap);`
			`ewah_free(si->replace_bitmap);`
			`free(si->saved_cache);`
			`si->delete_bitmap = NULL;`
			`si->replace_bitmap = NULL;`
			`si->saved_cache = NULL;`
			`si->saved_cache_nr = 0;`
read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00			`}`

			`void prepare_to_write_split_index(struct index_state *istate)`
			`{`
			`struct split_index *si = init_split_index(istate);`
split-index: the writing part prepare_to_write_split_index() does the major work, classifying deleted, updated and added entries. write_link_extension() then just writes it down. An observation is, deleting an entry, then adding it back is recorded as "entry X is deleted, entry X is added", not "entry X is replaced". This is simpler, with small overhead: a replaced entry is stored without its path, a new entry is store with its path. A note about unpack_trees() and the deduplication code inside prepare_to_write_split_index(). Usually tracking updated/removed entries via read-cache API is enough. unpack_trees() manipulates the index in a different way: it throws the entire source index out, builds up a new one, copying/duplicating entries (using dup_entry) from the source index over if necessary, then returns the new index. A naive solution would be marking the entire source index "deleted" and add their duplicates as new. That could bring $GIT_DIR/index back to the original size. So we try harder and memcmp() between the original and the duplicate to see if it needs updating. We could avoid memcmp() too, by avoiding duplicating the original entry in dup_entry(). The performance gain this way is within noise level and it complicates unpack-trees.c. So memcmp() is the preferred way to deal with deduplication. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:40 +02:00			`struct cache_entry *entries = NULL, ce;`
			`int i, nr_entries = 0, nr_alloc = 0;`

			`si->delete_bitmap = ewah_new();`
			`si->replace_bitmap = ewah_new();`

			`if (si->base) {`
			`/* Go through istate->cache[] and mark CE_MATCHED to`
			`* entry with positive index. We'll go through`
			`* base->cache[] later to delete all entries in base`
split-index: s/eith/with/ typo fix Signed-off-by: Christian Couder <chriscool@tuxfamily.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2016-10-23 11:26:30 +02:00			`* that are not marked with either CE_MATCHED or`
split-index: the writing part prepare_to_write_split_index() does the major work, classifying deleted, updated and added entries. write_link_extension() then just writes it down. An observation is, deleting an entry, then adding it back is recorded as "entry X is deleted, entry X is added", not "entry X is replaced". This is simpler, with small overhead: a replaced entry is stored without its path, a new entry is store with its path. A note about unpack_trees() and the deduplication code inside prepare_to_write_split_index(). Usually tracking updated/removed entries via read-cache API is enough. unpack_trees() manipulates the index in a different way: it throws the entire source index out, builds up a new one, copying/duplicating entries (using dup_entry) from the source index over if necessary, then returns the new index. A naive solution would be marking the entire source index "deleted" and add their duplicates as new. That could bring $GIT_DIR/index back to the original size. So we try harder and memcmp() between the original and the duplicate to see if it needs updating. We could avoid memcmp() too, by avoiding duplicating the original entry in dup_entry(). The performance gain this way is within noise level and it complicates unpack-trees.c. So memcmp() is the preferred way to deal with deduplication. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:40 +02:00			`* CE_UPDATE_IN_BASE. If istate->cache[i] is a`
			`* duplicate, deduplicate it.`
			`*/`
			`for (i = 0; i < istate->cache_nr; i++) {`
			`struct cache_entry *base;`
			`/* namelen is checked separately */`
			`const unsigned int ondisk_flags =`
			`CE_STAGEMASK \| CE_VALID \| CE_EXTENDED_FLAGS;`
			`unsigned int ce_flags, base_flags, ret;`
			`ce = istate->cache[i];`
			`if (!ce->index)`
			`continue;`
			`if (ce->index > si->base->cache_nr) {`
			`ce->index = 0;`
			`continue;`
			`}`
			`ce->ce_flags \|= CE_MATCHED; /* or "shared" */`
			`base = si->base->cache[ce->index - 1];`
			`if (ce == base)`
			`continue;`
			`if (ce->ce_namelen != base->ce_namelen \|\|`
			`strcmp(ce->name, base->name)) {`
			`ce->index = 0;`
			`continue;`
			`}`
			`ce_flags = ce->ce_flags;`
			`base_flags = base->ce_flags;`
			`/* only on-disk flags matter */`
			`ce->ce_flags &= ondisk_flags;`
			`base->ce_flags &= ondisk_flags;`
			`ret = memcmp(&ce->ce_stat_data, &base->ce_stat_data,`
			`offsetof(struct cache_entry, name) -`
			`offsetof(struct cache_entry, ce_stat_data));`
			`ce->ce_flags = ce_flags;`
			`base->ce_flags = base_flags;`
			`if (ret)`
			`ce->ce_flags \|= CE_UPDATE_IN_BASE;`
			`free(base);`
			`si->base->cache[ce->index - 1] = ce;`
			`}`
			`for (i = 0; i < si->base->cache_nr; i++) {`
			`ce = si->base->cache[i];`
			`if ((ce->ce_flags & CE_REMOVE) \|\|`
			`!(ce->ce_flags & CE_MATCHED))`
			`ewah_set(si->delete_bitmap, i);`
			`else if (ce->ce_flags & CE_UPDATE_IN_BASE) {`
			`ewah_set(si->replace_bitmap, i);`
split-index: strip pathname of on-disk replaced entries We know the positions of replaced entries via the replace bitmap in "link" extension, so the "name" path does not have to be stored (it's still in the shared index). With this, we also have a way to distinguish additions vs replacements at load time and can catch broken "link" extensions. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:43 +02:00			`ce->ce_flags \|= CE_STRIP_NAME;`
split-index: the writing part prepare_to_write_split_index() does the major work, classifying deleted, updated and added entries. write_link_extension() then just writes it down. An observation is, deleting an entry, then adding it back is recorded as "entry X is deleted, entry X is added", not "entry X is replaced". This is simpler, with small overhead: a replaced entry is stored without its path, a new entry is store with its path. A note about unpack_trees() and the deduplication code inside prepare_to_write_split_index(). Usually tracking updated/removed entries via read-cache API is enough. unpack_trees() manipulates the index in a different way: it throws the entire source index out, builds up a new one, copying/duplicating entries (using dup_entry) from the source index over if necessary, then returns the new index. A naive solution would be marking the entire source index "deleted" and add their duplicates as new. That could bring $GIT_DIR/index back to the original size. So we try harder and memcmp() between the original and the duplicate to see if it needs updating. We could avoid memcmp() too, by avoiding duplicating the original entry in dup_entry(). The performance gain this way is within noise level and it complicates unpack-trees.c. So memcmp() is the preferred way to deal with deduplication. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:40 +02:00			`ALLOC_GROW(entries, nr_entries+1, nr_alloc);`
			`entries[nr_entries++] = ce;`
			`}`
			`}`
			`}`

			`for (i = 0; i < istate->cache_nr; i++) {`
			`ce = istate->cache[i];`
			`if ((!si->base \|\| !ce->index) && !(ce->ce_flags & CE_REMOVE)) {`
split-index: strip pathname of on-disk replaced entries We know the positions of replaced entries via the replace bitmap in "link" extension, so the "name" path does not have to be stored (it's still in the shared index). With this, we also have a way to distinguish additions vs replacements at load time and can catch broken "link" extensions. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:43 +02:00			`assert(!(ce->ce_flags & CE_STRIP_NAME));`
split-index: the writing part prepare_to_write_split_index() does the major work, classifying deleted, updated and added entries. write_link_extension() then just writes it down. An observation is, deleting an entry, then adding it back is recorded as "entry X is deleted, entry X is added", not "entry X is replaced". This is simpler, with small overhead: a replaced entry is stored without its path, a new entry is store with its path. A note about unpack_trees() and the deduplication code inside prepare_to_write_split_index(). Usually tracking updated/removed entries via read-cache API is enough. unpack_trees() manipulates the index in a different way: it throws the entire source index out, builds up a new one, copying/duplicating entries (using dup_entry) from the source index over if necessary, then returns the new index. A naive solution would be marking the entire source index "deleted" and add their duplicates as new. That could bring $GIT_DIR/index back to the original size. So we try harder and memcmp() between the original and the duplicate to see if it needs updating. We could avoid memcmp() too, by avoiding duplicating the original entry in dup_entry(). The performance gain this way is within noise level and it complicates unpack-trees.c. So memcmp() is the preferred way to deal with deduplication. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:40 +02:00			`ALLOC_GROW(entries, nr_entries+1, nr_alloc);`
			`entries[nr_entries++] = ce;`
			`}`
			`ce->ce_flags &= ~CE_MATCHED;`
			`}`

			`/*`
			`* take cache[] out temporarily, put entries[] in its place`
			`* for writing`
			`*/`
			`si->saved_cache = istate->cache;`
read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00			`si->saved_cache_nr = istate->cache_nr;`
split-index: the writing part prepare_to_write_split_index() does the major work, classifying deleted, updated and added entries. write_link_extension() then just writes it down. An observation is, deleting an entry, then adding it back is recorded as "entry X is deleted, entry X is added", not "entry X is replaced". This is simpler, with small overhead: a replaced entry is stored without its path, a new entry is store with its path. A note about unpack_trees() and the deduplication code inside prepare_to_write_split_index(). Usually tracking updated/removed entries via read-cache API is enough. unpack_trees() manipulates the index in a different way: it throws the entire source index out, builds up a new one, copying/duplicating entries (using dup_entry) from the source index over if necessary, then returns the new index. A naive solution would be marking the entire source index "deleted" and add their duplicates as new. That could bring $GIT_DIR/index back to the original size. So we try harder and memcmp() between the original and the duplicate to see if it needs updating. We could avoid memcmp() too, by avoiding duplicating the original entry in dup_entry(). The performance gain this way is within noise level and it complicates unpack-trees.c. So memcmp() is the preferred way to deal with deduplication. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:40 +02:00			`istate->cache = entries;`
			`istate->cache_nr = nr_entries;`
read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00			`}`

			`void finish_writing_split_index(struct index_state *istate)`
			`{`
			`struct split_index *si = init_split_index(istate);`
split-index: the writing part prepare_to_write_split_index() does the major work, classifying deleted, updated and added entries. write_link_extension() then just writes it down. An observation is, deleting an entry, then adding it back is recorded as "entry X is deleted, entry X is added", not "entry X is replaced". This is simpler, with small overhead: a replaced entry is stored without its path, a new entry is store with its path. A note about unpack_trees() and the deduplication code inside prepare_to_write_split_index(). Usually tracking updated/removed entries via read-cache API is enough. unpack_trees() manipulates the index in a different way: it throws the entire source index out, builds up a new one, copying/duplicating entries (using dup_entry) from the source index over if necessary, then returns the new index. A naive solution would be marking the entire source index "deleted" and add their duplicates as new. That could bring $GIT_DIR/index back to the original size. So we try harder and memcmp() between the original and the duplicate to see if it needs updating. We could avoid memcmp() too, by avoiding duplicating the original entry in dup_entry(). The performance gain this way is within noise level and it complicates unpack-trees.c. So memcmp() is the preferred way to deal with deduplication. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:40 +02:00
			`ewah_free(si->delete_bitmap);`
			`ewah_free(si->replace_bitmap);`
			`si->delete_bitmap = NULL;`
			`si->replace_bitmap = NULL;`
			`free(istate->cache);`
			`istate->cache = si->saved_cache;`
read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00			`istate->cache_nr = si->saved_cache_nr;`
			`}`

			`void discard_split_index(struct index_state *istate)`
			`{`
			`struct split_index *si = istate->split_index;`
			`if (!si)`
			`return;`
			`istate->split_index = NULL;`
			`si->refcount--;`
			`if (si->refcount)`
			`return;`
			`if (si->base) {`
			`discard_index(si->base);`
			`free(si->base);`
			`}`
			`free(si);`
			`}`
read-cache: save deleted entries in split index Entries that belong to the base index should not be freed. Mark CE_REMOVE to track them. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:38 +02:00
			`void save_or_free_index_entry(struct index_state istate, struct cache_entry ce)`
			`{`
			`if (ce->index &&`
			`istate->split_index &&`
			`istate->split_index->base &&`
			`ce->index <= istate->split_index->base->cache_nr &&`
			`ce == istate->split_index->base->cache[ce->index - 1])`
			`ce->ce_flags \|= CE_REMOVE;`
			`else`
			`free(ce);`
			`}`
read-cache: mark updated entries for split index The large part of this patch just follows CE_ENTRY_CHANGED marks. replace_index_entry() is updated to update split_index->base->cache[] as well so base->cache[] does not reference to a freed entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:39 +02:00
			`void replace_index_entry_in_base(struct index_state *istate,`
			`struct cache_entry *old,`
			`struct cache_entry *new)`
			`{`
			`if (old->index &&`
			`istate->split_index &&`
			`istate->split_index->base &&`
			`old->index <= istate->split_index->base->cache_nr) {`
			`new->index = old->index;`
			`if (old != istate->split_index->base->cache[new->index - 1])`
			`free(istate->split_index->base->cache[new->index - 1]);`
			`istate->split_index->base->cache[new->index - 1] = new;`
			`}`
			`}`