Update comments.

author Robert Haas <[email protected]>

Thu, 26 Jul 2012 19:24:19 +0000 (15:24 -0400)

committer Robert Haas <[email protected]>

Thu, 26 Jul 2012 19:24:19 +0000 (15:24 -0400)
author Robert Haas <[email protected]>
Thu, 26 Jul 2012 19:24:19 +0000 (15:24 -0400)
committer Robert Haas <[email protected]>
Thu, 26 Jul 2012 19:24:19 +0000 (15:24 -0400)
diff --git a/src/backend/utils/hash/chash.c b/src/backend/utils/hash/chash.c

index 9e0ed3a120abc4ff2643fd0810dc0ee7fff6158c..be409f1749a9141e8ffa2d6597403ce9da7ac376 100644 (file)
--- a/src/backend/utils/hash/chash.c
+++ b/src/backend/utils/hash/chash.c
@@ -42,34 +42,43 @@
   * precisely one legal location at which a given new item can be inserted
   * into a bucket.
   *
- * For good concurrency, it seems essential to avoid locking buckets
- * while they are being scanned.  Taking even a shared LWLock or similar
- * still means acquiring and releasing a spinlock, with is both
- * inefficient in terms of raw cycles and a potential contention point.
- * Thus, we decree that readers must be able to scan bucket chains without
- * executing any atomic operations either before, during, or after the
- * scan.  Writers necessarily require some locking; for now, each bucket
- * has a separate spinlock which must be taken to modify that bucket chain,
- * but not when reading it.  In the future, we might further adapt this
- * code to instead use compare-and-swap where available.
+ * Items are inserted into buckets using compare-and-swap (CAS).  Thus, this
+ * module cannot be used on architectures where we do not have 4-byte
+ * compare-and-swap.  When an item is deleted, we first set its mark bit,
+ * which is stored within the next-pointer, again using CAS.  Once this
+ * step is completed, the node is deleted.  As a cleanup operation, we then
+ * use CAS to modify the next-pointer of the previous node to point to the
+ * node following the deleted node.  Note that, even once this cleanup
+ * operation has been performed, some other backend concurrently walking the
+ * chain might still be visiting the deleted node.  Thus, we must be certain
+ * not to overwrite the deleted node's next-pointer until all concurrent
+ * bucket scans have completed.  This means, in particular, that we cannot
+ * immediately view the deleted node as available for reuse.
   *
- * Even after an entity has been deleted from a bucket chain, it is still
- * possible that some other backend holds a pointer to it from a bucket
- * chain traversal which began before the deletion was carried out.
- * Thus, we cannot recycle the block of memory used by an entity for a
- * new and unrelated entity until we can guarantee that no private
- * references to it remain.  Instead, we add the entity to one of several
- * "garbage lists" of items removed from bucket chains that are not yet
- * known to be recyclable.  Periodically, we move items from garbage lists
- * to free lists from which they can be reallocated.  This is accomplished
- * by having each backend which wishes to scan a bucket store the hash
- * table id and bucket identifier in a per-backend slot in shared memory
- * before it begins scanning the bucket and clear the value only after it
- * finishes scanning the bucket, so that it is possible for another
- * backend to wait (by spinning) for all backends in the process of
- * scanning a bucket to finish doing so.  To make sure we don't need to
- * garbage-collect too often, we allocate a slightly larger arena than
- * the caller's stated maximum size.
+ * Instead, when a delete-marked node is removed from the bucket chain, it
+ * is added to one of many garbage lists.  There is a many-to-one mapping from
+ * buckets to garbage lists, so that the choice of bucket determines the
+ * garbage list but not visca versa.  Any process which wishes to scan a bucket
+ * must first advertise in shared memory the corresponding garbage list number.
+ * When a backend wishes to move entries from a garbage list to a free list,
+ * it must first wait (by spinning) for any backends scanning that garbage
+ * list to complete their scans.
+ *
+ * Ideally, it would be nice to make this completely lock-free, but because
+ * of the above-described choice of garbage collection algorithm, it currently
+ * isn't.  For an algorithm to be lock-free, it must be possible to suspend
+ * the execution of any number of processes for an arbitrary period of time
+ * without impeding the overall progress of the system.  For this code, that
+ * is true except when garbage collection occurs.  In that case, an insert,
+ * search, or delete operation can obstruct an inserting thread attempting to
+ * perform garbage collection for an unbounded period of time.  The algorithm
+ * could be adapted as to be completely lock-free, essentially by guaranteeing
+ * that even in the worst case no combination of processes can obstruct garbage
+ * collection to a sufficient degree as to prevent an inserter from finding
+ * an available entry in a hash table containing fewer live elements than its
+ * declared maximum capacity.  However, it's not clear that this is a good
+ * idea, because it would likely slow down operation in the non-contended
+ * case to solve a problem that we hope won't happen anyway.
   *
   * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
   * Portions Copyright (c) 1994, Regents of the University of California
@@ -87,7 +96,6 @@
  #include "storage/barrier.h"
  #include "storage/proc.h"
  #include "storage/shmem.h"
-#include "storage/spin.h"
  #include "utils/chash.h"
  #include "utils/memutils.h"
  
@@ -104,14 +112,6 @@ typedef uint32 CHashPtr;
  #define MakeCHashPtr(x)                                ((x) << 1)
  #define CHashMaxCapacity                       CHashPtrGetOffset(InvalidCHashPtr)
  
-/*
- * CHashBucket represents a single hash bucket, garbage list, or free list.
- */
-typedef struct
-{
-       CHashPtr        head;                           /* arena offset of bucket head */
-} CHashBucket;
-
  /*
   * Each object stored in the hash table is represented by a CHashNode, which
   * stores a pointer to the next item in the same bucket, and the exact hash
@@ -245,7 +245,7 @@ CHashBootstrap(CHashDescriptor *desc)
          * overallocate by one-eighth, but if that would be less than 15 elements,
          * then we allocate 15 elements instead.  This extra capacity can actually
          * be used, but for best performance, it shouldn't be.  It's the caller's
-        * responsibility to avoid this where relevant.
+        * responsibility to avoid this.
          */
         table->arena_limit = desc->capacity;
         if (desc->capacity < 120)
@@ -428,6 +428,12 @@ CHashSearch(CHashTable table, void *entry)
   * will insert the entry and return true.  Otherwise, the duplicate entry's
   * value will be copied into the supplied entry and the function will return
   * false.
+ *
+ * The caller is responsible for ensuring that no inserts are performed into
+ * a hash table which is at capacity.  The behavor of such an insert is
+ * undefined (the actual behavior is that the insert may either succeed,
+ * degrading performance; or CHashAllocate may enter a tight loop until such
+ * time as an element is deleted).
   */
  bool
  CHashInsert(CHashTable table, void *entry)
author	Robert Haas <[email protected]>
	Thu, 26 Jul 2012 19:24:19 +0000 (15:24 -0400)
committer	Robert Haas <[email protected]>
	Thu, 26 Jul 2012 19:24:19 +0000 (15:24 -0400)