mirrors/git - Incest Forge: Beyond sex. We incest.

mirrors/git

mirror of https://github.com/git/git.git synced 2024-11-01 23:07:55 +01:00

377 lines

11 KiB

C

Raw Normal View History

read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00			`#include "cache.h"`
			`#include "split-index.h"`
split-index: the writing part prepare_to_write_split_index() does the major work, classifying deleted, updated and added entries. write_link_extension() then just writes it down. An observation is, deleting an entry, then adding it back is recorded as "entry X is deleted, entry X is added", not "entry X is replaced". This is simpler, with small overhead: a replaced entry is stored without its path, a new entry is store with its path. A note about unpack_trees() and the deduplication code inside prepare_to_write_split_index(). Usually tracking updated/removed entries via read-cache API is enough. unpack_trees() manipulates the index in a different way: it throws the entire source index out, builds up a new one, copying/duplicating entries (using dup_entry) from the source index over if necessary, then returns the new index. A naive solution would be marking the entire source index "deleted" and add their duplicates as new. That could bring $GIT_DIR/index back to the original size. So we try harder and memcmp() between the original and the duplicate to see if it needs updating. We could avoid memcmp() too, by avoiding duplicating the original entry in dup_entry(). The performance gain this way is within noise level and it complicates unpack-trees.c. So memcmp() is the preferred way to deal with deduplication. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:40 +02:00			`#include "ewah/ewok.h"`
read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00
			`struct split_index init_split_index(struct index_state istate)`
			`{`
			`if (!istate->split_index) {`
			`istate->split_index = xcalloc(1, sizeof(*istate->split_index));`
			`istate->split_index->refcount = 1;`
			`}`
			`return istate->split_index;`
			`}`

			`int read_link_extension(struct index_state *istate,`
			`const void *data_, unsigned long sz)`
			`{`
			`const unsigned char *data = data_;`
			`struct split_index *si;`
split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00			`int ret;`

split-index: convert struct split_index to object_id Convert the base_sha1 member of struct split_index to use struct object_id and rename it base_oid. Include cache.h to make the structure visible. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2018-05-02 02:25:43 +02:00			`if (sz < the_hash_algo->rawsz)`
read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00			`return error("corrupt link extension (too short)");`
			`si = init_split_index(istate);`
split-index: convert struct split_index to object_id Convert the base_sha1 member of struct split_index to use struct object_id and rename it base_oid. Include cache.h to make the structure visible. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2018-05-02 02:25:43 +02:00			`hashcpy(si->base_oid.hash, data);`
			`data += the_hash_algo->rawsz;`
			`sz -= the_hash_algo->rawsz;`
split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00			`if (!sz)`
			`return 0;`
			`si->delete_bitmap = ewah_new();`
			`ret = ewah_read_mmap(si->delete_bitmap, data, sz);`
			`if (ret < 0)`
			`return error("corrupt delete bitmap in link extension");`
			`data += ret;`
			`sz -= ret;`
			`si->replace_bitmap = ewah_new();`
			`ret = ewah_read_mmap(si->replace_bitmap, data, sz);`
			`if (ret < 0)`
			`return error("corrupt replace bitmap in link extension");`
			`if (ret != sz)`
read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00			`return error("garbage at the end of link extension");`
			`return 0;`
			`}`

			`int write_link_extension(struct strbuf *sb,`
			`struct index_state *istate)`
			`{`
			`struct split_index *si = istate->split_index;`
split-index: convert struct split_index to object_id Convert the base_sha1 member of struct split_index to use struct object_id and rename it base_oid. Include cache.h to make the structure visible. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2018-05-02 02:25:43 +02:00			`strbuf_add(sb, si->base_oid.hash, the_hash_algo->rawsz);`
split-index: the writing part prepare_to_write_split_index() does the major work, classifying deleted, updated and added entries. write_link_extension() then just writes it down. An observation is, deleting an entry, then adding it back is recorded as "entry X is deleted, entry X is added", not "entry X is replaced". This is simpler, with small overhead: a replaced entry is stored without its path, a new entry is store with its path. A note about unpack_trees() and the deduplication code inside prepare_to_write_split_index(). Usually tracking updated/removed entries via read-cache API is enough. unpack_trees() manipulates the index in a different way: it throws the entire source index out, builds up a new one, copying/duplicating entries (using dup_entry) from the source index over if necessary, then returns the new index. A naive solution would be marking the entire source index "deleted" and add their duplicates as new. That could bring $GIT_DIR/index back to the original size. So we try harder and memcmp() between the original and the duplicate to see if it needs updating. We could avoid memcmp() too, by avoiding duplicating the original entry in dup_entry(). The performance gain this way is within noise level and it complicates unpack-trees.c. So memcmp() is the preferred way to deal with deduplication. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:40 +02:00			`if (!si->delete_bitmap && !si->replace_bitmap)`
			`return 0;`
ewah: add convenient wrapper ewah_serialize_strbuf() Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2015-03-08 11:12:32 +01:00			`ewah_serialize_strbuf(si->delete_bitmap, sb);`
			`ewah_serialize_strbuf(si->replace_bitmap, sb);`
read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00			`return 0;`
			`}`

			`static void mark_base_index_entries(struct index_state *base)`
			`{`
			`int i;`
			`/*`
			`* To keep track of the shared entries between`
			`* istate->base->cache[] and istate->cache[], base entry`
			`* position is stored in each base entry. All positions start`
typofix: assorted typofixes in comments, documentation and messages Many instances of duplicate words (e.g. "the the path") and a few typoes are fixed, originally in multiple patches. wildmatch: fix duplicate words of "the" t: fix duplicate words of "output" transport-helper: fix duplicate words of "read" Git.pm: fix duplicate words of "return" path: fix duplicate words of "look" pack-protocol.txt: fix duplicate words of "the" precompose-utf8: fix typo of "sequences" split-index: fix typo worktree.c: fix typo remote-ext: fix typo utf8: fix duplicate words of "the" git-cvsserver: fix duplicate words Signed-off-by: Li Peng <lip@dtdream.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2016-05-06 14:36:46 +02:00			`* from 1 instead of 0, which is reserved to say "this is a new`
read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00			`* entry".`
			`*/`
			`for (i = 0; i < base->cache_nr; i++)`
			`base->cache[i]->index = i + 1;`
			`}`

update-index: new options to enable/disable split index mode If you have a large work tree but only make changes in a subset, then $GIT_DIR/index's size should be stable after a while. If you change branches that touch something else, $GIT_DIR/index's size may grow large that it becomes as slow as the unified index. Do --split-index again occasionally to force all changes back to the shared index and keep $GIT_DIR/index small. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:44 +02:00			`void move_cache_to_base_index(struct index_state *istate)`
			`{`
			`struct split_index *si = istate->split_index;`
			`int i;`

			`/*`
block alloc: allocate cache entries from mem_pool When reading large indexes from disk, a portion of the time is dominated in malloc() calls. This can be mitigated by allocating a large block of memory and manage it ourselves via memory pools. This change moves the cache entry allocation to be on top of memory pools. Design: The index_state struct will gain a notion of an associated memory_pool from which cache_entries will be allocated from. When reading in the index from disk, we have information on the number of entries and their size, which can guide us in deciding how large our initial memory allocation should be. When an index is discarded, the associated memory_pool will be discarded as well - so the lifetime of a cache_entry is tied to the lifetime of the index_state that it was allocated for. In the case of a Split Index, the following rules are followed. 1st, some terminology is defined: Terminology: - 'the_index': represents the logical view of the index - 'split_index': represents the "base" cache entries. Read from the split index file. 'the_index' can reference a single split_index, as well as cache_entries from the split_index. `the_index` will be discarded before the `split_index` is. This means that when we are allocating cache_entries in the presence of a split index, we need to allocate the entries from the `split_index`'s memory pool. This allows us to follow the pattern that `the_index` can reference cache_entries from the `split_index`, and that the cache_entries will not be freed while they are still being referenced. Managing transient cache_entry structs: Cache entries are usually allocated for an index, but this is not always the case. Cache entries are sometimes allocated because this is the type that the existing checkout_entry function works with. Because of this, the existing code needs to handle cache entries associated with an index / memory pool, and those that only exist transiently. Several strategies were contemplated around how to handle this: Chosen approach: An extra field was added to the cache_entry type to track whether the cache_entry was allocated from a memory pool or not. This is currently an int field, as there are no more available bits in the existing ce_flags bit field. If / when more bits are needed, this new field can be turned into a proper bit field. Alternatives: 1) Do not include any information about how the cache_entry was allocated. Calling code would be responsible for tracking whether the cache_entry needed to be freed or not. Pro: No extra memory overhead to track this state Con: Extra complexity in callers to handle this correctly. The extra complexity and burden to not regress this behavior in the future was more than we wanted. 2) cache_entry would gain knowledge about which mem_pool allocated it Pro: Could (potentially) do extra logic to know when a mem_pool no longer had references to any cache_entry Con: cache_entry would grow heavier by a pointer, instead of int We didn't see a tangible benefit to this approach 3) Do not add any extra information to a cache_entry, but when freeing a cache entry, check if the memory exists in a region managed by existing mem_pools. Pro: No extra memory overhead to track state Con: Extra computation is performed when freeing cache entries We decided tracking and iterating over known memory pool regions was less desirable than adding an extra field to track this stae. Signed-off-by: Jameson Miller <jamill@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2018-07-02 21:49:37 +02:00			`* If there was a previous base index, then transfer ownership of allocated`
			`* entries to the parent index.`
update-index: new options to enable/disable split index mode If you have a large work tree but only make changes in a subset, then $GIT_DIR/index's size should be stable after a while. If you change branches that touch something else, $GIT_DIR/index's size may grow large that it becomes as slow as the unified index. Do --split-index again occasionally to force all changes back to the shared index and keep $GIT_DIR/index small. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:44 +02:00			`*/`
block alloc: allocate cache entries from mem_pool When reading large indexes from disk, a portion of the time is dominated in malloc() calls. This can be mitigated by allocating a large block of memory and manage it ourselves via memory pools. This change moves the cache entry allocation to be on top of memory pools. Design: The index_state struct will gain a notion of an associated memory_pool from which cache_entries will be allocated from. When reading in the index from disk, we have information on the number of entries and their size, which can guide us in deciding how large our initial memory allocation should be. When an index is discarded, the associated memory_pool will be discarded as well - so the lifetime of a cache_entry is tied to the lifetime of the index_state that it was allocated for. In the case of a Split Index, the following rules are followed. 1st, some terminology is defined: Terminology: - 'the_index': represents the logical view of the index - 'split_index': represents the "base" cache entries. Read from the split index file. 'the_index' can reference a single split_index, as well as cache_entries from the split_index. `the_index` will be discarded before the `split_index` is. This means that when we are allocating cache_entries in the presence of a split index, we need to allocate the entries from the `split_index`'s memory pool. This allows us to follow the pattern that `the_index` can reference cache_entries from the `split_index`, and that the cache_entries will not be freed while they are still being referenced. Managing transient cache_entry structs: Cache entries are usually allocated for an index, but this is not always the case. Cache entries are sometimes allocated because this is the type that the existing checkout_entry function works with. Because of this, the existing code needs to handle cache entries associated with an index / memory pool, and those that only exist transiently. Several strategies were contemplated around how to handle this: Chosen approach: An extra field was added to the cache_entry type to track whether the cache_entry was allocated from a memory pool or not. This is currently an int field, as there are no more available bits in the existing ce_flags bit field. If / when more bits are needed, this new field can be turned into a proper bit field. Alternatives: 1) Do not include any information about how the cache_entry was allocated. Calling code would be responsible for tracking whether the cache_entry needed to be freed or not. Pro: No extra memory overhead to track this state Con: Extra complexity in callers to handle this correctly. The extra complexity and burden to not regress this behavior in the future was more than we wanted. 2) cache_entry would gain knowledge about which mem_pool allocated it Pro: Could (potentially) do extra logic to know when a mem_pool no longer had references to any cache_entry Con: cache_entry would grow heavier by a pointer, instead of int We didn't see a tangible benefit to this approach 3) Do not add any extra information to a cache_entry, but when freeing a cache entry, check if the memory exists in a region managed by existing mem_pools. Pro: No extra memory overhead to track state Con: Extra computation is performed when freeing cache entries We decided tracking and iterating over known memory pool regions was less desirable than adding an extra field to track this stae. Signed-off-by: Jameson Miller <jamill@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2018-07-02 21:49:37 +02:00			`if (si->base &&`
			`si->base->ce_mem_pool) {`

			`if (!istate->ce_mem_pool)`
			`mem_pool_init(&istate->ce_mem_pool, 0);`

			`mem_pool_combine(istate->ce_mem_pool, istate->split_index->base->ce_mem_pool);`
			`}`

update-index: new options to enable/disable split index mode If you have a large work tree but only make changes in a subset, then $GIT_DIR/index's size should be stable after a while. If you change branches that touch something else, $GIT_DIR/index's size may grow large that it becomes as slow as the unified index. Do --split-index again occasionally to force all changes back to the shared index and keep $GIT_DIR/index small. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:44 +02:00			`si->base = xcalloc(1, sizeof(*si->base));`
			`si->base->version = istate->version;`
			`/* zero timestamp disables racy test in ce_write_index() */`
			`si->base->timestamp = istate->timestamp;`
			`ALLOC_GROW(si->base->cache, istate->cache_nr, si->base->cache_alloc);`
			`si->base->cache_nr = istate->cache_nr;`
block alloc: allocate cache entries from mem_pool When reading large indexes from disk, a portion of the time is dominated in malloc() calls. This can be mitigated by allocating a large block of memory and manage it ourselves via memory pools. This change moves the cache entry allocation to be on top of memory pools. Design: The index_state struct will gain a notion of an associated memory_pool from which cache_entries will be allocated from. When reading in the index from disk, we have information on the number of entries and their size, which can guide us in deciding how large our initial memory allocation should be. When an index is discarded, the associated memory_pool will be discarded as well - so the lifetime of a cache_entry is tied to the lifetime of the index_state that it was allocated for. In the case of a Split Index, the following rules are followed. 1st, some terminology is defined: Terminology: - 'the_index': represents the logical view of the index - 'split_index': represents the "base" cache entries. Read from the split index file. 'the_index' can reference a single split_index, as well as cache_entries from the split_index. `the_index` will be discarded before the `split_index` is. This means that when we are allocating cache_entries in the presence of a split index, we need to allocate the entries from the `split_index`'s memory pool. This allows us to follow the pattern that `the_index` can reference cache_entries from the `split_index`, and that the cache_entries will not be freed while they are still being referenced. Managing transient cache_entry structs: Cache entries are usually allocated for an index, but this is not always the case. Cache entries are sometimes allocated because this is the type that the existing checkout_entry function works with. Because of this, the existing code needs to handle cache entries associated with an index / memory pool, and those that only exist transiently. Several strategies were contemplated around how to handle this: Chosen approach: An extra field was added to the cache_entry type to track whether the cache_entry was allocated from a memory pool or not. This is currently an int field, as there are no more available bits in the existing ce_flags bit field. If / when more bits are needed, this new field can be turned into a proper bit field. Alternatives: 1) Do not include any information about how the cache_entry was allocated. Calling code would be responsible for tracking whether the cache_entry needed to be freed or not. Pro: No extra memory overhead to track this state Con: Extra complexity in callers to handle this correctly. The extra complexity and burden to not regress this behavior in the future was more than we wanted. 2) cache_entry would gain knowledge about which mem_pool allocated it Pro: Could (potentially) do extra logic to know when a mem_pool no longer had references to any cache_entry Con: cache_entry would grow heavier by a pointer, instead of int We didn't see a tangible benefit to this approach 3) Do not add any extra information to a cache_entry, but when freeing a cache entry, check if the memory exists in a region managed by existing mem_pools. Pro: No extra memory overhead to track state Con: Extra computation is performed when freeing cache entries We decided tracking and iterating over known memory pool regions was less desirable than adding an extra field to track this stae. Signed-off-by: Jameson Miller <jamill@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2018-07-02 21:49:37 +02:00
			`/*`
			`* The mem_pool needs to move with the allocated entries.`
			`*/`
			`si->base->ce_mem_pool = istate->ce_mem_pool;`
			`istate->ce_mem_pool = NULL;`

use COPY_ARRAY Add a semantic patch for converting certain calls of memcpy(3) to COPY_ARRAY() and apply that transformation to the code base. The result is shorter and safer code. For now only consider calls where source and destination have the same type, or in other words: easy cases. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2016-09-25 09:24:03 +02:00			`COPY_ARRAY(si->base->cache, istate->cache, istate->cache_nr);`
update-index: new options to enable/disable split index mode If you have a large work tree but only make changes in a subset, then $GIT_DIR/index's size should be stable after a while. If you change branches that touch something else, $GIT_DIR/index's size may grow large that it becomes as slow as the unified index. Do --split-index again occasionally to force all changes back to the shared index and keep $GIT_DIR/index small. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:44 +02:00			`mark_base_index_entries(si->base);`
			`for (i = 0; i < si->base->cache_nr; i++)`
			`si->base->cache[i]->ce_flags &= ~CE_UPDATE_IN_BASE;`
			`}`

split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00			`static void mark_entry_for_delete(size_t pos, void *data)`
			`{`
			`struct index_state *istate = data;`
			`if (pos >= istate->cache_nr)`
			`die("position for delete %d exceeds base index size %d",`
			`(int)pos, istate->cache_nr);`
			`istate->cache[pos]->ce_flags \|= CE_REMOVE;`
			`istate->split_index->nr_deletions = 1;`
			`}`

			`static void replace_entry(size_t pos, void *data)`
			`{`
			`struct index_state *istate = data;`
			`struct split_index *si = istate->split_index;`
			`struct cache_entry dst, src;`
split-index: strip pathname of on-disk replaced entries We know the positions of replaced entries via the replace bitmap in "link" extension, so the "name" path does not have to be stored (it's still in the shared index). With this, we also have a way to distinguish additions vs replacements at load time and can catch broken "link" extensions. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:43 +02:00
split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00			`if (pos >= istate->cache_nr)`
			`die("position for replacement %d exceeds base index size %d",`
			`(int)pos, istate->cache_nr);`
			`if (si->nr_replacements >= si->saved_cache_nr)`
			`die("too many replacements (%d vs %d)",`
			`si->nr_replacements, si->saved_cache_nr);`
			`dst = istate->cache[pos];`
			`if (dst->ce_flags & CE_REMOVE)`
			`die("entry %d is marked as both replaced and deleted",`
			`(int)pos);`
			`src = si->saved_cache[si->nr_replacements];`
split-index: strip pathname of on-disk replaced entries We know the positions of replaced entries via the replace bitmap in "link" extension, so the "name" path does not have to be stored (it's still in the shared index). With this, we also have a way to distinguish additions vs replacements at load time and can catch broken "link" extensions. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:43 +02:00			`if (ce_namelen(src))`
			`die("corrupt link extension, entry %d should have "`
			`"zero length name", (int)pos);`
split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00			`src->index = pos + 1;`
			`src->ce_flags \|= CE_UPDATE_IN_BASE;`
split-index: strip pathname of on-disk replaced entries We know the positions of replaced entries via the replace bitmap in "link" extension, so the "name" path does not have to be stored (it's still in the shared index). With this, we also have a way to distinguish additions vs replacements at load time and can catch broken "link" extensions. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:43 +02:00			`src->ce_namelen = dst->ce_namelen;`
			`copy_cache_entry(dst, src);`
block alloc: add lifecycle APIs for cache_entry structs It has been observed that the time spent loading an index with a large number of entries is partly dominated by malloc() calls. This change is in preparation for using memory pools to reduce the number of malloc() calls made to allocate cahce entries when loading an index. Add an API to allocate and discard cache entries, abstracting the details of managing the memory backing the cache entries. This commit does actually change how memory is managed - this will be done in a later commit in the series. This change makes the distinction between cache entries that are associated with an index and cache entries that are not associated with an index. A main use of cache entries is with an index, and we can optimize the memory management around this. We still have other cases where a cache entry is not persisted with an index, and so we need to handle the "transient" use case as well. To keep the congnitive overhead of managing the cache entries, there will only be a single discard function. This means there must be enough information kept with the cache entry so that we know how to discard them. A summary of the main functions in the API is: make_cache_entry: create cache entry for use in an index. Uses specified parameters to populate cache_entry fields. make_empty_cache_entry: Create an empty cache entry for use in an index. Returns cache entry with empty fields. make_transient_cache_entry: create cache entry that is not used in an index. Uses specified parameters to populate cache_entry fields. make_empty_transient_cache_entry: create cache entry that is not used in an index. Returns cache entry with empty fields. discard_cache_entry: A single function that knows how to discard a cache entry regardless of how it was allocated. Signed-off-by: Jameson Miller <jamill@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2018-07-02 21:49:31 +02:00			`discard_cache_entry(src);`
split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00			`si->nr_replacements++;`
			`}`

read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00			`void merge_base_index(struct index_state *istate)`
			`{`
			`struct split_index *si = istate->split_index;`
split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00			`unsigned int i;`
read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00
			`mark_base_index_entries(si->base);`
split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00
			`si->saved_cache = istate->cache;`
			`si->saved_cache_nr = istate->cache_nr;`
			`istate->cache_nr = si->base->cache_nr;`
			`istate->cache = NULL;`
			`istate->cache_alloc = 0;`
read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00			`ALLOC_GROW(istate->cache, istate->cache_nr, istate->cache_alloc);`
use COPY_ARRAY Add a semantic patch for converting certain calls of memcpy(3) to COPY_ARRAY() and apply that transformation to the code base. The result is shorter and safer code. For now only consider calls where source and destination have the same type, or in other words: easy cases. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2016-09-25 09:24:03 +02:00			`COPY_ARRAY(istate->cache, si->base->cache, istate->cache_nr);`
split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00
			`si->nr_deletions = 0;`
			`si->nr_replacements = 0;`
			`ewah_each_bit(si->replace_bitmap, replace_entry, istate);`
			`ewah_each_bit(si->delete_bitmap, mark_entry_for_delete, istate);`
			`if (si->nr_deletions)`
			`remove_marked_cache_entries(istate);`

			`for (i = si->nr_replacements; i < si->saved_cache_nr; i++) {`
split-index: strip pathname of on-disk replaced entries We know the positions of replaced entries via the replace bitmap in "link" extension, so the "name" path does not have to be stored (it's still in the shared index). With this, we also have a way to distinguish additions vs replacements at load time and can catch broken "link" extensions. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:43 +02:00			`if (!ce_namelen(si->saved_cache[i]))`
			`die("corrupt link extension, entry %d should "`
			`"have non-zero length name", i);`
split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00			`add_index_entry(istate, si->saved_cache[i],`
			`ADD_CACHE_OK_TO_ADD \|`
split-index: do not invalidate cache-tree at read time We are sure that after merge_base_index() is done. cache-tree can still be used with the final index. So don't destroy cache tree. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:42 +02:00			`ADD_CACHE_KEEP_CACHE_TREE \|`
split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00			`/*`
			`* we may have to replay what`
			`* merge-recursive.c:update_stages()`
			`* does, which has this flag on`
			`*/`
			`ADD_CACHE_SKIP_DFCHECK);`
			`si->saved_cache[i] = NULL;`
			`}`

			`ewah_free(si->delete_bitmap);`
			`ewah_free(si->replace_bitmap);`
*.[ch] refactoring: make use of the FREE_AND_NULL() macro Replace occurrences of `free(ptr); ptr = NULL` which weren't caught by the coccinelle rule. These fall into two categories: - free/NULL assignments one after the other which coccinelle all put on one line, which is functionally equivalent code, but very ugly. - manually spotted occurrences where the NULL assignment isn't right after the free() call. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2017-06-16 01:15:49 +02:00			`FREE_AND_NULL(si->saved_cache);`
split-index: the reading part CE_REMOVE'd entries are removed here because only parts of the code base (unpack_trees in fact) test this bit when they look for the presence of an entry. Leaving them may confuse the code ignores this bit and expects to see a real entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:41 +02:00			`si->delete_bitmap = NULL;`
			`si->replace_bitmap = NULL;`
			`si->saved_cache_nr = 0;`
read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00			`}`

			`void prepare_to_write_split_index(struct index_state *istate)`
			`{`
			`struct split_index *si = init_split_index(istate);`
split-index: the writing part prepare_to_write_split_index() does the major work, classifying deleted, updated and added entries. write_link_extension() then just writes it down. An observation is, deleting an entry, then adding it back is recorded as "entry X is deleted, entry X is added", not "entry X is replaced". This is simpler, with small overhead: a replaced entry is stored without its path, a new entry is store with its path. A note about unpack_trees() and the deduplication code inside prepare_to_write_split_index(). Usually tracking updated/removed entries via read-cache API is enough. unpack_trees() manipulates the index in a different way: it throws the entire source index out, builds up a new one, copying/duplicating entries (using dup_entry) from the source index over if necessary, then returns the new index. A naive solution would be marking the entire source index "deleted" and add their duplicates as new. That could bring $GIT_DIR/index back to the original size. So we try harder and memcmp() between the original and the duplicate to see if it needs updating. We could avoid memcmp() too, by avoiding duplicating the original entry in dup_entry(). The performance gain this way is within noise level and it complicates unpack-trees.c. So memcmp() is the preferred way to deal with deduplication. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:40 +02:00			`struct cache_entry *entries = NULL, ce;`
			`int i, nr_entries = 0, nr_alloc = 0;`

			`si->delete_bitmap = ewah_new();`
			`si->replace_bitmap = ewah_new();`

			`if (si->base) {`
			`/* Go through istate->cache[] and mark CE_MATCHED to`
			`* entry with positive index. We'll go through`
			`* base->cache[] later to delete all entries in base`
split-index: s/eith/with/ typo fix Signed-off-by: Christian Couder <chriscool@tuxfamily.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2016-10-23 11:26:30 +02:00			`* that are not marked with either CE_MATCHED or`
split-index: the writing part prepare_to_write_split_index() does the major work, classifying deleted, updated and added entries. write_link_extension() then just writes it down. An observation is, deleting an entry, then adding it back is recorded as "entry X is deleted, entry X is added", not "entry X is replaced". This is simpler, with small overhead: a replaced entry is stored without its path, a new entry is store with its path. A note about unpack_trees() and the deduplication code inside prepare_to_write_split_index(). Usually tracking updated/removed entries via read-cache API is enough. unpack_trees() manipulates the index in a different way: it throws the entire source index out, builds up a new one, copying/duplicating entries (using dup_entry) from the source index over if necessary, then returns the new index. A naive solution would be marking the entire source index "deleted" and add their duplicates as new. That could bring $GIT_DIR/index back to the original size. So we try harder and memcmp() between the original and the duplicate to see if it needs updating. We could avoid memcmp() too, by avoiding duplicating the original entry in dup_entry(). The performance gain this way is within noise level and it complicates unpack-trees.c. So memcmp() is the preferred way to deal with deduplication. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:40 +02:00			`* CE_UPDATE_IN_BASE. If istate->cache[i] is a`
			`* duplicate, deduplicate it.`
			`*/`
			`for (i = 0; i < istate->cache_nr; i++) {`
			`struct cache_entry *base;`
			`/* namelen is checked separately */`
			`const unsigned int ondisk_flags =`
			`CE_STAGEMASK \| CE_VALID \| CE_EXTENDED_FLAGS;`
			`unsigned int ce_flags, base_flags, ret;`
			`ce = istate->cache[i];`
			`if (!ce->index)`
			`continue;`
			`if (ce->index > si->base->cache_nr) {`
			`ce->index = 0;`
			`continue;`
			`}`
			`ce->ce_flags \|= CE_MATCHED; /* or "shared" */`
			`base = si->base->cache[ce->index - 1];`
			`if (ce == base)`
			`continue;`
			`if (ce->ce_namelen != base->ce_namelen \|\|`
			`strcmp(ce->name, base->name)) {`
			`ce->index = 0;`
			`continue;`
			`}`
			`ce_flags = ce->ce_flags;`
			`base_flags = base->ce_flags;`
			`/* only on-disk flags matter */`
			`ce->ce_flags &= ondisk_flags;`
			`base->ce_flags &= ondisk_flags;`
			`ret = memcmp(&ce->ce_stat_data, &base->ce_stat_data,`
			`offsetof(struct cache_entry, name) -`
			`offsetof(struct cache_entry, ce_stat_data));`
			`ce->ce_flags = ce_flags;`
			`base->ce_flags = base_flags;`
			`if (ret)`
			`ce->ce_flags \|= CE_UPDATE_IN_BASE;`
block alloc: add lifecycle APIs for cache_entry structs It has been observed that the time spent loading an index with a large number of entries is partly dominated by malloc() calls. This change is in preparation for using memory pools to reduce the number of malloc() calls made to allocate cahce entries when loading an index. Add an API to allocate and discard cache entries, abstracting the details of managing the memory backing the cache entries. This commit does actually change how memory is managed - this will be done in a later commit in the series. This change makes the distinction between cache entries that are associated with an index and cache entries that are not associated with an index. A main use of cache entries is with an index, and we can optimize the memory management around this. We still have other cases where a cache entry is not persisted with an index, and so we need to handle the "transient" use case as well. To keep the congnitive overhead of managing the cache entries, there will only be a single discard function. This means there must be enough information kept with the cache entry so that we know how to discard them. A summary of the main functions in the API is: make_cache_entry: create cache entry for use in an index. Uses specified parameters to populate cache_entry fields. make_empty_cache_entry: Create an empty cache entry for use in an index. Returns cache entry with empty fields. make_transient_cache_entry: create cache entry that is not used in an index. Uses specified parameters to populate cache_entry fields. make_empty_transient_cache_entry: create cache entry that is not used in an index. Returns cache entry with empty fields. discard_cache_entry: A single function that knows how to discard a cache entry regardless of how it was allocated. Signed-off-by: Jameson Miller <jamill@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2018-07-02 21:49:31 +02:00			`discard_cache_entry(base);`
split-index: the writing part prepare_to_write_split_index() does the major work, classifying deleted, updated and added entries. write_link_extension() then just writes it down. An observation is, deleting an entry, then adding it back is recorded as "entry X is deleted, entry X is added", not "entry X is replaced". This is simpler, with small overhead: a replaced entry is stored without its path, a new entry is store with its path. A note about unpack_trees() and the deduplication code inside prepare_to_write_split_index(). Usually tracking updated/removed entries via read-cache API is enough. unpack_trees() manipulates the index in a different way: it throws the entire source index out, builds up a new one, copying/duplicating entries (using dup_entry) from the source index over if necessary, then returns the new index. A naive solution would be marking the entire source index "deleted" and add their duplicates as new. That could bring $GIT_DIR/index back to the original size. So we try harder and memcmp() between the original and the duplicate to see if it needs updating. We could avoid memcmp() too, by avoiding duplicating the original entry in dup_entry(). The performance gain this way is within noise level and it complicates unpack-trees.c. So memcmp() is the preferred way to deal with deduplication. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:40 +02:00			`si->base->cache[ce->index - 1] = ce;`
			`}`
			`for (i = 0; i < si->base->cache_nr; i++) {`
			`ce = si->base->cache[i];`
			`if ((ce->ce_flags & CE_REMOVE) \|\|`
			`!(ce->ce_flags & CE_MATCHED))`
			`ewah_set(si->delete_bitmap, i);`
			`else if (ce->ce_flags & CE_UPDATE_IN_BASE) {`
			`ewah_set(si->replace_bitmap, i);`
split-index: strip pathname of on-disk replaced entries We know the positions of replaced entries via the replace bitmap in "link" extension, so the "name" path does not have to be stored (it's still in the shared index). With this, we also have a way to distinguish additions vs replacements at load time and can catch broken "link" extensions. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:43 +02:00			`ce->ce_flags \|= CE_STRIP_NAME;`
split-index: the writing part prepare_to_write_split_index() does the major work, classifying deleted, updated and added entries. write_link_extension() then just writes it down. An observation is, deleting an entry, then adding it back is recorded as "entry X is deleted, entry X is added", not "entry X is replaced". This is simpler, with small overhead: a replaced entry is stored without its path, a new entry is store with its path. A note about unpack_trees() and the deduplication code inside prepare_to_write_split_index(). Usually tracking updated/removed entries via read-cache API is enough. unpack_trees() manipulates the index in a different way: it throws the entire source index out, builds up a new one, copying/duplicating entries (using dup_entry) from the source index over if necessary, then returns the new index. A naive solution would be marking the entire source index "deleted" and add their duplicates as new. That could bring $GIT_DIR/index back to the original size. So we try harder and memcmp() between the original and the duplicate to see if it needs updating. We could avoid memcmp() too, by avoiding duplicating the original entry in dup_entry(). The performance gain this way is within noise level and it complicates unpack-trees.c. So memcmp() is the preferred way to deal with deduplication. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:40 +02:00			`ALLOC_GROW(entries, nr_entries+1, nr_alloc);`
			`entries[nr_entries++] = ce;`
			`}`
split-index: don't write cache tree with null oid entries In a96d3cc3f6 ("cache-tree: reject entries with null sha1", 2017-04-21) we made sure that broken cache entries do not get propagated to new trees. Part of that was making sure not to re-use an existing cache tree that includes a null oid. It did so by dropping the cache tree in 'do_write_index()' if one of the entries contains a null oid. In split index mode however, there are two invocations to 'do_write_index()', one for the shared index and one for the split index. The cache tree is only written once, to the split index. As we only loop through the elements that are effectively being written by the current invocation, that may not include the entry with a null oid in the split index (when it is already written to the shared index), where we write the cache tree. Therefore in split index mode we may still end up writing the cache tree, even though there is an entry with a null oid in the index. Fix this by checking for null oids in prepare_to_write_split_index, where we loop the entries of the shared index as well as the entries for the split index. This fixes t7009 with GIT_TEST_SPLIT_INDEX. Also add a new test that's more specifically showing the problem. Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2018-01-07 23:30:14 +01:00			`if (is_null_oid(&ce->oid))`
			`istate->drop_cache_tree = 1;`
split-index: the writing part prepare_to_write_split_index() does the major work, classifying deleted, updated and added entries. write_link_extension() then just writes it down. An observation is, deleting an entry, then adding it back is recorded as "entry X is deleted, entry X is added", not "entry X is replaced". This is simpler, with small overhead: a replaced entry is stored without its path, a new entry is store with its path. A note about unpack_trees() and the deduplication code inside prepare_to_write_split_index(). Usually tracking updated/removed entries via read-cache API is enough. unpack_trees() manipulates the index in a different way: it throws the entire source index out, builds up a new one, copying/duplicating entries (using dup_entry) from the source index over if necessary, then returns the new index. A naive solution would be marking the entire source index "deleted" and add their duplicates as new. That could bring $GIT_DIR/index back to the original size. So we try harder and memcmp() between the original and the duplicate to see if it needs updating. We could avoid memcmp() too, by avoiding duplicating the original entry in dup_entry(). The performance gain this way is within noise level and it complicates unpack-trees.c. So memcmp() is the preferred way to deal with deduplication. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:40 +02:00			`}`
			`}`

			`for (i = 0; i < istate->cache_nr; i++) {`
			`ce = istate->cache[i];`
			`if ((!si->base \|\| !ce->index) && !(ce->ce_flags & CE_REMOVE)) {`
split-index: strip pathname of on-disk replaced entries We know the positions of replaced entries via the replace bitmap in "link" extension, so the "name" path does not have to be stored (it's still in the shared index). With this, we also have a way to distinguish additions vs replacements at load time and can catch broken "link" extensions. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:43 +02:00			`assert(!(ce->ce_flags & CE_STRIP_NAME));`
split-index: the writing part prepare_to_write_split_index() does the major work, classifying deleted, updated and added entries. write_link_extension() then just writes it down. An observation is, deleting an entry, then adding it back is recorded as "entry X is deleted, entry X is added", not "entry X is replaced". This is simpler, with small overhead: a replaced entry is stored without its path, a new entry is store with its path. A note about unpack_trees() and the deduplication code inside prepare_to_write_split_index(). Usually tracking updated/removed entries via read-cache API is enough. unpack_trees() manipulates the index in a different way: it throws the entire source index out, builds up a new one, copying/duplicating entries (using dup_entry) from the source index over if necessary, then returns the new index. A naive solution would be marking the entire source index "deleted" and add their duplicates as new. That could bring $GIT_DIR/index back to the original size. So we try harder and memcmp() between the original and the duplicate to see if it needs updating. We could avoid memcmp() too, by avoiding duplicating the original entry in dup_entry(). The performance gain this way is within noise level and it complicates unpack-trees.c. So memcmp() is the preferred way to deal with deduplication. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:40 +02:00			`ALLOC_GROW(entries, nr_entries+1, nr_alloc);`
			`entries[nr_entries++] = ce;`
			`}`
			`ce->ce_flags &= ~CE_MATCHED;`
			`}`

			`/*`
			`* take cache[] out temporarily, put entries[] in its place`
			`* for writing`
			`*/`
			`si->saved_cache = istate->cache;`
read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00			`si->saved_cache_nr = istate->cache_nr;`
split-index: the writing part prepare_to_write_split_index() does the major work, classifying deleted, updated and added entries. write_link_extension() then just writes it down. An observation is, deleting an entry, then adding it back is recorded as "entry X is deleted, entry X is added", not "entry X is replaced". This is simpler, with small overhead: a replaced entry is stored without its path, a new entry is store with its path. A note about unpack_trees() and the deduplication code inside prepare_to_write_split_index(). Usually tracking updated/removed entries via read-cache API is enough. unpack_trees() manipulates the index in a different way: it throws the entire source index out, builds up a new one, copying/duplicating entries (using dup_entry) from the source index over if necessary, then returns the new index. A naive solution would be marking the entire source index "deleted" and add their duplicates as new. That could bring $GIT_DIR/index back to the original size. So we try harder and memcmp() between the original and the duplicate to see if it needs updating. We could avoid memcmp() too, by avoiding duplicating the original entry in dup_entry(). The performance gain this way is within noise level and it complicates unpack-trees.c. So memcmp() is the preferred way to deal with deduplication. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:40 +02:00			`istate->cache = entries;`
			`istate->cache_nr = nr_entries;`
read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00			`}`

			`void finish_writing_split_index(struct index_state *istate)`
			`{`
			`struct split_index *si = init_split_index(istate);`
split-index: the writing part prepare_to_write_split_index() does the major work, classifying deleted, updated and added entries. write_link_extension() then just writes it down. An observation is, deleting an entry, then adding it back is recorded as "entry X is deleted, entry X is added", not "entry X is replaced". This is simpler, with small overhead: a replaced entry is stored without its path, a new entry is store with its path. A note about unpack_trees() and the deduplication code inside prepare_to_write_split_index(). Usually tracking updated/removed entries via read-cache API is enough. unpack_trees() manipulates the index in a different way: it throws the entire source index out, builds up a new one, copying/duplicating entries (using dup_entry) from the source index over if necessary, then returns the new index. A naive solution would be marking the entire source index "deleted" and add their duplicates as new. That could bring $GIT_DIR/index back to the original size. So we try harder and memcmp() between the original and the duplicate to see if it needs updating. We could avoid memcmp() too, by avoiding duplicating the original entry in dup_entry(). The performance gain this way is within noise level and it complicates unpack-trees.c. So memcmp() is the preferred way to deal with deduplication. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:40 +02:00
			`ewah_free(si->delete_bitmap);`
			`ewah_free(si->replace_bitmap);`
			`si->delete_bitmap = NULL;`
			`si->replace_bitmap = NULL;`
			`free(istate->cache);`
			`istate->cache = si->saved_cache;`
read-cache: split-index mode This split-index mode is designed to keep write cost proportional to the number of changes the user has made, not the size of the work tree. (Read cost is another matter, to be dealt separately.) This mode stores index info in a pair of $GIT_DIR/index and $GIT_DIR/sharedindex.<SHA-1>. sharedindex is large and unchanged over time while "index" is smaller and updated often. Format details are in index-format.txt, although not everything is implemented in this patch. Shared indexes are not automatically removed, because it's unclear if the shared index is needed by any (even temporary) indexes by just looking at it. After a while you'll collect stale shared indexes. The good news is one shared index is useable for long, until $GIT_DIR/index becomes too big and sluggish that the new shared index must be created. The safest way to clean shared indexes is to turn off split index mode, so shared files are all garbage, delete them all, then turn on split index mode again. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:36 +02:00			`istate->cache_nr = si->saved_cache_nr;`
			`}`

			`void discard_split_index(struct index_state *istate)`
			`{`
			`struct split_index *si = istate->split_index;`
			`if (!si)`
			`return;`
			`istate->split_index = NULL;`
			`si->refcount--;`
			`if (si->refcount)`
			`return;`
			`if (si->base) {`
			`discard_index(si->base);`
			`free(si->base);`
			`}`
			`free(si);`
			`}`
read-cache: save deleted entries in split index Entries that belong to the base index should not be freed. Mark CE_REMOVE to track them. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:38 +02:00
			`void save_or_free_index_entry(struct index_state istate, struct cache_entry ce)`
			`{`
			`if (ce->index &&`
			`istate->split_index &&`
			`istate->split_index->base &&`
			`ce->index <= istate->split_index->base->cache_nr &&`
			`ce == istate->split_index->base->cache[ce->index - 1])`
			`ce->ce_flags \|= CE_REMOVE;`
			`else`
block alloc: add lifecycle APIs for cache_entry structs It has been observed that the time spent loading an index with a large number of entries is partly dominated by malloc() calls. This change is in preparation for using memory pools to reduce the number of malloc() calls made to allocate cahce entries when loading an index. Add an API to allocate and discard cache entries, abstracting the details of managing the memory backing the cache entries. This commit does actually change how memory is managed - this will be done in a later commit in the series. This change makes the distinction between cache entries that are associated with an index and cache entries that are not associated with an index. A main use of cache entries is with an index, and we can optimize the memory management around this. We still have other cases where a cache entry is not persisted with an index, and so we need to handle the "transient" use case as well. To keep the congnitive overhead of managing the cache entries, there will only be a single discard function. This means there must be enough information kept with the cache entry so that we know how to discard them. A summary of the main functions in the API is: make_cache_entry: create cache entry for use in an index. Uses specified parameters to populate cache_entry fields. make_empty_cache_entry: Create an empty cache entry for use in an index. Returns cache entry with empty fields. make_transient_cache_entry: create cache entry that is not used in an index. Uses specified parameters to populate cache_entry fields. make_empty_transient_cache_entry: create cache entry that is not used in an index. Returns cache entry with empty fields. discard_cache_entry: A single function that knows how to discard a cache entry regardless of how it was allocated. Signed-off-by: Jameson Miller <jamill@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2018-07-02 21:49:31 +02:00			`discard_cache_entry(ce);`
read-cache: save deleted entries in split index Entries that belong to the base index should not be freed. Mark CE_REMOVE to track them. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:38 +02:00			`}`
read-cache: mark updated entries for split index The large part of this patch just follows CE_ENTRY_CHANGED marks. replace_index_entry() is updated to update split_index->base->cache[] as well so base->cache[] does not reference to a freed entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:39 +02:00
			`void replace_index_entry_in_base(struct index_state *istate,`
split-index: rename 'new' variables Rename C++ keyword in order to bring the codebase closer to being able to be compiled with a C++ compiler. Signed-off-by: Brandon Williams <bmwill@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2018-02-14 19:59:48 +01:00			`struct cache_entry *old_entry,`
			`struct cache_entry *new_entry)`
read-cache: mark updated entries for split index The large part of this patch just follows CE_ENTRY_CHANGED marks. replace_index_entry() is updated to update split_index->base->cache[] as well so base->cache[] does not reference to a freed entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:39 +02:00			`{`
split-index: rename 'new' variables Rename C++ keyword in order to bring the codebase closer to being able to be compiled with a C++ compiler. Signed-off-by: Brandon Williams <bmwill@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2018-02-14 19:59:48 +01:00			`if (old_entry->index &&`
read-cache: mark updated entries for split index The large part of this patch just follows CE_ENTRY_CHANGED marks. replace_index_entry() is updated to update split_index->base->cache[] as well so base->cache[] does not reference to a freed entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:39 +02:00			`istate->split_index &&`
			`istate->split_index->base &&`
split-index: rename 'new' variables Rename C++ keyword in order to bring the codebase closer to being able to be compiled with a C++ compiler. Signed-off-by: Brandon Williams <bmwill@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2018-02-14 19:59:48 +01:00			`old_entry->index <= istate->split_index->base->cache_nr) {`
			`new_entry->index = old_entry->index;`
			`if (old_entry != istate->split_index->base->cache[new_entry->index - 1])`
block alloc: add lifecycle APIs for cache_entry structs It has been observed that the time spent loading an index with a large number of entries is partly dominated by malloc() calls. This change is in preparation for using memory pools to reduce the number of malloc() calls made to allocate cahce entries when loading an index. Add an API to allocate and discard cache entries, abstracting the details of managing the memory backing the cache entries. This commit does actually change how memory is managed - this will be done in a later commit in the series. This change makes the distinction between cache entries that are associated with an index and cache entries that are not associated with an index. A main use of cache entries is with an index, and we can optimize the memory management around this. We still have other cases where a cache entry is not persisted with an index, and so we need to handle the "transient" use case as well. To keep the congnitive overhead of managing the cache entries, there will only be a single discard function. This means there must be enough information kept with the cache entry so that we know how to discard them. A summary of the main functions in the API is: make_cache_entry: create cache entry for use in an index. Uses specified parameters to populate cache_entry fields. make_empty_cache_entry: Create an empty cache entry for use in an index. Returns cache entry with empty fields. make_transient_cache_entry: create cache entry that is not used in an index. Uses specified parameters to populate cache_entry fields. make_empty_transient_cache_entry: create cache entry that is not used in an index. Returns cache entry with empty fields. discard_cache_entry: A single function that knows how to discard a cache entry regardless of how it was allocated. Signed-off-by: Jameson Miller <jamill@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2018-07-02 21:49:31 +02:00			`discard_cache_entry(istate->split_index->base->cache[new_entry->index - 1]);`
split-index: rename 'new' variables Rename C++ keyword in order to bring the codebase closer to being able to be compiled with a C++ compiler. Signed-off-by: Brandon Williams <bmwill@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2018-02-14 19:59:48 +01:00			`istate->split_index->base->cache[new_entry->index - 1] = new_entry;`
read-cache: mark updated entries for split index The large part of this patch just follows CE_ENTRY_CHANGED marks. replace_index_entry() is updated to update split_index->base->cache[] as well so base->cache[] does not reference to a freed entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2014-06-13 14:19:39 +02:00			`}`
			`}`
split-index: add {add,remove}_split_index() functions Also use the functions in cmd_update_index() in builtin/update-index.c. These functions will be used in a following commit to tweak our use of the split-index feature depending on the setting of a configuration variable. Signed-off-by: Christian Couder <chriscool@tuxfamily.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2017-02-27 19:00:01 +01:00
			`void add_split_index(struct index_state *istate)`
			`{`
			`if (!istate->split_index) {`
			`init_split_index(istate);`
			`istate->cache_changed \|= SPLIT_INDEX_ORDERED;`
			`}`
			`}`

			`void remove_split_index(struct index_state *istate)`
			`{`
Revert "split-index: add and use unshare_split_index()" This reverts commit f9d7abec2ad2f9eb3d8873169cc28c34273df082; see public-inbox.org/git/CAP8UFD0bOfzY-_hBDKddOcJdPUpP2KEVaX_SrCgvAMYAHtseiQ@mail.gmail.com 2017-06-24 21:02:39 +02:00			`if (istate->split_index) {`
			`/*`
block alloc: allocate cache entries from mem_pool When reading large indexes from disk, a portion of the time is dominated in malloc() calls. This can be mitigated by allocating a large block of memory and manage it ourselves via memory pools. This change moves the cache entry allocation to be on top of memory pools. Design: The index_state struct will gain a notion of an associated memory_pool from which cache_entries will be allocated from. When reading in the index from disk, we have information on the number of entries and their size, which can guide us in deciding how large our initial memory allocation should be. When an index is discarded, the associated memory_pool will be discarded as well - so the lifetime of a cache_entry is tied to the lifetime of the index_state that it was allocated for. In the case of a Split Index, the following rules are followed. 1st, some terminology is defined: Terminology: - 'the_index': represents the logical view of the index - 'split_index': represents the "base" cache entries. Read from the split index file. 'the_index' can reference a single split_index, as well as cache_entries from the split_index. `the_index` will be discarded before the `split_index` is. This means that when we are allocating cache_entries in the presence of a split index, we need to allocate the entries from the `split_index`'s memory pool. This allows us to follow the pattern that `the_index` can reference cache_entries from the `split_index`, and that the cache_entries will not be freed while they are still being referenced. Managing transient cache_entry structs: Cache entries are usually allocated for an index, but this is not always the case. Cache entries are sometimes allocated because this is the type that the existing checkout_entry function works with. Because of this, the existing code needs to handle cache entries associated with an index / memory pool, and those that only exist transiently. Several strategies were contemplated around how to handle this: Chosen approach: An extra field was added to the cache_entry type to track whether the cache_entry was allocated from a memory pool or not. This is currently an int field, as there are no more available bits in the existing ce_flags bit field. If / when more bits are needed, this new field can be turned into a proper bit field. Alternatives: 1) Do not include any information about how the cache_entry was allocated. Calling code would be responsible for tracking whether the cache_entry needed to be freed or not. Pro: No extra memory overhead to track this state Con: Extra complexity in callers to handle this correctly. The extra complexity and burden to not regress this behavior in the future was more than we wanted. 2) cache_entry would gain knowledge about which mem_pool allocated it Pro: Could (potentially) do extra logic to know when a mem_pool no longer had references to any cache_entry Con: cache_entry would grow heavier by a pointer, instead of int We didn't see a tangible benefit to this approach 3) Do not add any extra information to a cache_entry, but when freeing a cache entry, check if the memory exists in a region managed by existing mem_pools. Pro: No extra memory overhead to track state Con: Extra computation is performed when freeing cache entries We decided tracking and iterating over known memory pool regions was less desirable than adding an extra field to track this stae. Signed-off-by: Jameson Miller <jamill@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2018-07-02 21:49:37 +02:00			`* When removing the split index, we need to move`
			`* ownership of the mem_pool associated with the`
			`* base index to the main index. There may be cache entries`
			`* allocated from the base's memory pool that are shared with`
			`* the_index.cache[].`
Revert "split-index: add and use unshare_split_index()" This reverts commit f9d7abec2ad2f9eb3d8873169cc28c34273df082; see public-inbox.org/git/CAP8UFD0bOfzY-_hBDKddOcJdPUpP2KEVaX_SrCgvAMYAHtseiQ@mail.gmail.com 2017-06-24 21:02:39 +02:00			`*/`
block alloc: allocate cache entries from mem_pool When reading large indexes from disk, a portion of the time is dominated in malloc() calls. This can be mitigated by allocating a large block of memory and manage it ourselves via memory pools. This change moves the cache entry allocation to be on top of memory pools. Design: The index_state struct will gain a notion of an associated memory_pool from which cache_entries will be allocated from. When reading in the index from disk, we have information on the number of entries and their size, which can guide us in deciding how large our initial memory allocation should be. When an index is discarded, the associated memory_pool will be discarded as well - so the lifetime of a cache_entry is tied to the lifetime of the index_state that it was allocated for. In the case of a Split Index, the following rules are followed. 1st, some terminology is defined: Terminology: - 'the_index': represents the logical view of the index - 'split_index': represents the "base" cache entries. Read from the split index file. 'the_index' can reference a single split_index, as well as cache_entries from the split_index. `the_index` will be discarded before the `split_index` is. This means that when we are allocating cache_entries in the presence of a split index, we need to allocate the entries from the `split_index`'s memory pool. This allows us to follow the pattern that `the_index` can reference cache_entries from the `split_index`, and that the cache_entries will not be freed while they are still being referenced. Managing transient cache_entry structs: Cache entries are usually allocated for an index, but this is not always the case. Cache entries are sometimes allocated because this is the type that the existing checkout_entry function works with. Because of this, the existing code needs to handle cache entries associated with an index / memory pool, and those that only exist transiently. Several strategies were contemplated around how to handle this: Chosen approach: An extra field was added to the cache_entry type to track whether the cache_entry was allocated from a memory pool or not. This is currently an int field, as there are no more available bits in the existing ce_flags bit field. If / when more bits are needed, this new field can be turned into a proper bit field. Alternatives: 1) Do not include any information about how the cache_entry was allocated. Calling code would be responsible for tracking whether the cache_entry needed to be freed or not. Pro: No extra memory overhead to track this state Con: Extra complexity in callers to handle this correctly. The extra complexity and burden to not regress this behavior in the future was more than we wanted. 2) cache_entry would gain knowledge about which mem_pool allocated it Pro: Could (potentially) do extra logic to know when a mem_pool no longer had references to any cache_entry Con: cache_entry would grow heavier by a pointer, instead of int We didn't see a tangible benefit to this approach 3) Do not add any extra information to a cache_entry, but when freeing a cache entry, check if the memory exists in a region managed by existing mem_pools. Pro: No extra memory overhead to track state Con: Extra computation is performed when freeing cache entries We decided tracking and iterating over known memory pool regions was less desirable than adding an extra field to track this stae. Signed-off-by: Jameson Miller <jamill@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2018-07-02 21:49:37 +02:00			`mem_pool_combine(istate->ce_mem_pool, istate->split_index->base->ce_mem_pool);`

			`/*`
			`* The split index no longer owns the mem_pool backing`
			`* its cache array. As we are discarding this index,`
			`* mark the index as having no cache entries, so it`
			`* will not attempt to clean up the cache entries or`
			`* validate them.`
			`*/`
			`if (istate->split_index->base)`
			`istate->split_index->base->cache_nr = 0;`

			`/*`
			`* We can discard the split index because its`
			`* memory pool has been incorporated into the`
			`* memory pool associated with the the_index.`
			`*/`
			`discard_split_index(istate);`

Revert "split-index: add and use unshare_split_index()" This reverts commit f9d7abec2ad2f9eb3d8873169cc28c34273df082; see public-inbox.org/git/CAP8UFD0bOfzY-_hBDKddOcJdPUpP2KEVaX_SrCgvAMYAHtseiQ@mail.gmail.com 2017-06-24 21:02:39 +02:00			`istate->cache_changed \|= SOMETHING_CHANGED;`
			`}`
split-index: add {add,remove}_split_index() functions Also use the functions in cmd_update_index() in builtin/update-index.c. These functions will be used in a following commit to tweak our use of the split-index feature depending on the setting of a configuration variable. Signed-off-by: Christian Couder <chriscool@tuxfamily.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> 2017-02-27 19:00:01 +01:00			`}`