* precisely one legal location at which a given new item can be inserted
* into a bucket.
*
- * For good concurrency, it seems essential to avoid locking buckets
- * while they are being scanned. Taking even a shared LWLock or similar
- * still means acquiring and releasing a spinlock, with is both
- * inefficient in terms of raw cycles and a potential contention point.
- * Thus, we decree that readers must be able to scan bucket chains without
- * executing any atomic operations either before, during, or after the
- * scan. Writers necessarily require some locking; for now, each bucket
- * has a separate spinlock which must be taken to modify that bucket chain,
- * but not when reading it. In the future, we might further adapt this
- * code to instead use compare-and-swap where available.
+ * Items are inserted into buckets using compare-and-swap (CAS). Thus, this
+ * module cannot be used on architectures where we do not have 4-byte
+ * compare-and-swap. When an item is deleted, we first set its mark bit,
+ * which is stored within the next-pointer, again using CAS. Once this
+ * step is completed, the node is deleted. As a cleanup operation, we then
+ * use CAS to modify the next-pointer of the previous node to point to the
+ * node following the deleted node. Note that, even once this cleanup
+ * operation has been performed, some other backend concurrently walking the
+ * chain might still be visiting the deleted node. Thus, we must be certain
+ * not to overwrite the deleted node's next-pointer until all concurrent
+ * bucket scans have completed. This means, in particular, that we cannot
+ * immediately view the deleted node as available for reuse.
*
- * Even after an entity has been deleted from a bucket chain, it is still
- * possible that some other backend holds a pointer to it from a bucket
- * chain traversal which began before the deletion was carried out.
- * Thus, we cannot recycle the block of memory used by an entity for a
- * new and unrelated entity until we can guarantee that no private
- * references to it remain. Instead, we add the entity to one of several
- * "garbage lists" of items removed from bucket chains that are not yet
- * known to be recyclable. Periodically, we move items from garbage lists
- * to free lists from which they can be reallocated. This is accomplished
- * by having each backend which wishes to scan a bucket store the hash
- * table id and bucket identifier in a per-backend slot in shared memory
- * before it begins scanning the bucket and clear the value only after it
- * finishes scanning the bucket, so that it is possible for another
- * backend to wait (by spinning) for all backends in the process of
- * scanning a bucket to finish doing so. To make sure we don't need to
- * garbage-collect too often, we allocate a slightly larger arena than
- * the caller's stated maximum size.
+ * Instead, when a delete-marked node is removed from the bucket chain, it
+ * is added to one of many garbage lists. There is a many-to-one mapping from
+ * buckets to garbage lists, so that the choice of bucket determines the
+ * garbage list but not visca versa. Any process which wishes to scan a bucket
+ * must first advertise in shared memory the corresponding garbage list number.
+ * When a backend wishes to move entries from a garbage list to a free list,
+ * it must first wait (by spinning) for any backends scanning that garbage
+ * list to complete their scans.
+ *
+ * Ideally, it would be nice to make this completely lock-free, but because
+ * of the above-described choice of garbage collection algorithm, it currently
+ * isn't. For an algorithm to be lock-free, it must be possible to suspend
+ * the execution of any number of processes for an arbitrary period of time
+ * without impeding the overall progress of the system. For this code, that
+ * is true except when garbage collection occurs. In that case, an insert,
+ * search, or delete operation can obstruct an inserting thread attempting to
+ * perform garbage collection for an unbounded period of time. The algorithm
+ * could be adapted as to be completely lock-free, essentially by guaranteeing
+ * that even in the worst case no combination of processes can obstruct garbage
+ * collection to a sufficient degree as to prevent an inserter from finding
+ * an available entry in a hash table containing fewer live elements than its
+ * declared maximum capacity. However, it's not clear that this is a good
+ * idea, because it would likely slow down operation in the non-contended
+ * case to solve a problem that we hope won't happen anyway.
*
* Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
#include "storage/barrier.h"
#include "storage/proc.h"
#include "storage/shmem.h"
-#include "storage/spin.h"
#include "utils/chash.h"
#include "utils/memutils.h"
#define MakeCHashPtr(x) ((x) << 1)
#define CHashMaxCapacity CHashPtrGetOffset(InvalidCHashPtr)
-/*
- * CHashBucket represents a single hash bucket, garbage list, or free list.
- */
-typedef struct
-{
- CHashPtr head; /* arena offset of bucket head */
-} CHashBucket;
-
/*
* Each object stored in the hash table is represented by a CHashNode, which
* stores a pointer to the next item in the same bucket, and the exact hash
* overallocate by one-eighth, but if that would be less than 15 elements,
* then we allocate 15 elements instead. This extra capacity can actually
* be used, but for best performance, it shouldn't be. It's the caller's
- * responsibility to avoid this where relevant.
+ * responsibility to avoid this.
*/
table->arena_limit = desc->capacity;
if (desc->capacity < 120)
* will insert the entry and return true. Otherwise, the duplicate entry's
* value will be copied into the supplied entry and the function will return
* false.
+ *
+ * The caller is responsible for ensuring that no inserts are performed into
+ * a hash table which is at capacity. The behavor of such an insert is
+ * undefined (the actual behavior is that the insert may either succeed,
+ * degrading performance; or CHashAllocate may enter a tight loop until such
+ * time as an element is deleted).
*/
bool
CHashInsert(CHashTable table, void *entry)