X-Git-Url: http://git.liburcu.org/?p=userspace-rcu.git;a=blobdiff_plain;f=doc%2Frcuja-design.txt;fp=doc%2Frcuja-design.txt;h=09c915017cc1bd927c491f8814d46579d8577ad7;hp=0000000000000000000000000000000000000000;hb=db61f215ed34cf68d4a7e8ea7cde9bfea19f476f;hpb=19b8479e24d9902fc97f053b3b064e552d08c2a5 diff --git a/doc/rcuja-design.txt b/doc/rcuja-design.txt new file mode 100644 index 0000000..09c9150 --- /dev/null +++ b/doc/rcuja-design.txt @@ -0,0 +1,448 @@ +RCU Judy Array Design +Mathieu Desnoyers +March 8, 2012 + +Initial ideas based on the released Judy Shop Manual +(http://judy.sourceforge.net/). Judy was invented by Doug Baskins and +implemented by Hewlett-Packard. + +Thresholds and RCU-specific analysis is introduced in this document. + +Advantages of using Judy Array (compressed nodes) for RCU tree: +- no rebalancing +- no transplant +- RCU-friendly! +- favor cache-line alignment of structures + +Disadvantage: +- updates that need to reallocate nodes are slower than, e.g. non-rcu + red-black trees. + +Choice: Using 256 entries intermediate nodes (index can be represented +on 8 bits): 4 levels on 32-bit, 8 levels on 64-bit + + +* Node types (from less dense node to most dense) + + +- empty node: + +Parent pointer is NULL. + + +- Type A: sequential search in value and pointer arrays + ++ Add/removal just needs to update value and pointer array, single-entry + (non-RCU...). For RCU, we might need to update the entire node anyway. +- Requires sequential search through all value array for lookup fail + test. + +Filled at 3 entries max 64-bit +8 bits indicating number of children +Array of 8-bit values followed by array of associated pointers. +64-bit: 1 byte + 3 bytes + 4 bytes pad + 3*8 = 32 bytes + +-> up to this point on 64-bit, sequential lookup and pointer read fit in +a 32-byte cache line. + - lookup fail&success: 1 cache-line. + +Filled at 6 entries max 32-bit, 7 entries max 64-bit +8 bits indicating number of children +Array of 8-bit values followed by array of associated pointers. +32-bit: 1 byte + 6 bytes + 1 byte pad + 6*4bytes = 32 bytes +64-bit: 1 byte + 7 bytes + 7*8 = 64 bytes + +-> up to this point on 32-bit, sequential lookup and pointer read fit in +a 32-byte cache line. + - lookup fail&success: 1 cache-line. + +Filled at 12 entries max 32-bit, 14 entries max 64-bit +8 bits indicating number of children +Array of 8-bit values followed by array of associated pointers. +32-bit: 1 byte + 12 bytes + 3 bytes pad + 12*4bytes = 64 bytes +64-bit: 1 byte + 14 bytes + 1 byte pad + 14*8 = 128 bytes + +Filled at 25 entries max 32-bit, 28 entries max 64-bit +8 bits indicating number of children +Array of 8-bit values followed by array of associated pointers. +32-bit: 1 byte + 25 bytes + 2 bytes pad + 25*4bytes = 128 bytes +64-bit: 1 byte + 28 bytes + 3 bytes pad + 28*8 = 256 bytes + +---> up to this point, on both 32-bit and 64-bit, the sequential lookup +in values array fits in a 32-byte cache line. + - lookup failure: 1 cache line. + - lookup success: 2 cache lines. + +The two below are listed for completeness sake, but because they require +2 32-byte cache lines for lookup, these are deemed inappropriate. + +Filled at 51 entries max 32-bit, 56 entries max 64-bit +8 bits indicating number of children +Array of 8-bit values followed by array of associated pointers. +32-bit: 1 byte + 51 bytes + 51*4bytes = 256 bytes +64-bit: 1 byte + 56 bytes + 7 bytes pad + 56*8 = 512 bytes + +Filled at 102 entries max 32-bit, 113 entries max 64-bit +8 bits indicating number of children +Array of 8-bit values followed by array of associated pointers. +32-bit: 1 byte + 102 bytes + 1 byte pad + 102*4bytes = 512 bytes +64-bit: 1 byte + 113 bytes + 6 bytes pad + 113*8 = 1024 bytes + + +- Type B: pools of values and pointers arrays + +Pools of values and pointers arrays. Each pool values array is 32-bytes +in size (so it fits in a L1 cacheline). Each pool begins with an 8-bit +integer, which is the number of children in this pool, followed by an +array of 8-bit values, padding, and an array of pointers. Values and +pointer arrays are associated as in Type A. + +The entries of a node are associated to their respective pool based +on their index position. + ++ Allows lookup failure to use 1 32-byte cache-line only. (1 cacheline) + lookup success: 2 cache lines. + ++ Allows in-place updates without reallocation, except when a pool is + full. (this was not possible with bitmap-based nodes) +- If one pool exhausts its space, we need to increase the node size. + Therefore, for very dense populations, we will end up using the + pigeon-hole node type sooner, thus consuming more space. + +Pool configuration: + +Per pool, filled at 25 entries (32-bit), 28 entries (64-bit) +32-bit: 1 byte + 25 bytes + 2 bytes pad + 25*4bytes = 128 bytes +64-bit: 1 byte + 28 bytes + 3 bytes pad + 28*8 = 256 bytes + +Total up to 50 entries (32-bit), 56 entries (64-bit) +2 pools: 32-bit = 256 bytes +2 pools: 64-bit = 512 bytes + +Total up to 100 entries (32-bit), 112 entries (64-bit) +4 pools: 32-bit = 512 bytes +4 pools: 32-bit = 1024 bytes + + +* Choice of pool configuration distribution: + +We have pools of either 2 or 4 linear arrays. Their total size is +between 256 bytes (32-bit 2 arrays) and 1024 bytes (64-bit 4 arrays). + +Alignment on 256 bytes means that we can spare the 8 least significant +bits of the pointers. Given that the type selection already uses 3 bits, +we have 7 bits left. + +Alignment on 512 bytes -> 8 bits left. + +We can therefore encode which bit, or which two bits, are used as +distribution selection. We can use this technique to reequilibrate pools +if they become unbalanced (e.g. all children are within one of the two +linear arrays). + +Assuming that finding the exact sub-pool usage maximum for any given +distribution is NP complete (not proven). + +Taking into account sub-class size unbalance (tested programmatically by +randomly taking N entries from 256, calculating the distribution for +each bit (number of nodes for which bit is one/zero), and calculating +the difference in number of nodes for each bit, choosing the minimum +difference -- for millions of runs). + +We start from the "ideal" population size (with no unbalance), and do a +fixed point to find the appropriate population size. + +tot entries subclass extra items largest linear array (stat. approx.) +--------------------------------------------------------------------- +48 entries: 1 (98%) 24+1=25 (target ~50/2=25) +54 entries: 1 (97%) 27+1=28 (target ~56/2=28) + +Note: there exists rare worse cases where the unbalance is larger, but +it happens _very_ rarely. But need to provide a fallback if the subclass +does not fit, but it does not need to be efficient. + + +For pool of size 4, we need to approximate what is the maximum unbalance +we can get for choice of distributions grouped by pairs of bits. + +tot entries subclass extra items largest linear array (stat. approx.) +--------------------------------------------------------------------- +92 entries: 2 (99%) 23+2=25 (target: ~100/4=25) +104 entries: 2 (99%) 26+2=28 (target: ~112/4=28) + + +Note: there exists rare worse cases where the unbalance is larger, but +it happens _very_ rarely. But need to provide a fallback if the subclass +does not fit, but it does not need to be efficient. + + +* Population "does not fit" and distribution fallback + +When adding a child to a distribution node, if the child does not fit, +we recalculate the best distribution. If it does not fit in that +distribution neither, we need to expand the node type. + +When removing a child, if the node child count is brought to the number +of entries expected to statistically fit in the lower order node, we try +to shrink. However, if we notice that the distribution does not actually +fit in that shrinked node, we abort the shrink operation. If shrink +fails, we keep a counter of insertion/removal operations on the node +before we allow the shrink to be attempted again. + + +- Type C: pigeon-hole array + +Filled at 47.2%/48.8% or more (32-bit: 121 entries+, 64-bit: 125 entries+) +Array of children node pointers. Pointers NULL if no child at index. +32-bit: 4*256 = 1024 bytes +64-bit: 8*256 = 2048 bytes + + +* Analysis of the thresholds: + +Analysis of number of cache-lines touched for each node, per-node-type, +depending on the number of children per node, as we increment the number +of children from 0 to 256. Through this, we choose number of children +thresholds at which it is worthwhile to use a different node type. + +Per node: + +- ALWAYS 1 cache line hit for lookup failure (all cases) + +32-bit + +- Unexisting + +0 children + +- Type A: sequential search in value and pointer arrays +- 1 cache line hit for lookup success +- 32 bytes storage + +up to 6 children + +- 2 cache line hit for lookup success +- 64 bytes storage + +up to 12 children + +- 128 bytes storage + +up to 25 children + +- Type B: pool + +- 256 bytes storage + +up to 50 children + +- 512 bytes storage +up to 100 children + +- Type C: pigeon-hole array +- 1 cache line hit for lookup success +- 1024 bytes storage + +up to 256 children + + +64-bit + +- Unexisting + +0 children + +- Type A: sequential search in value and pointer arrays +- 1 cache line hit for lookup success +- 32 bytes storage + +up to 3 children + +- 2 cache line hit for lookup success +- 64 bytes storage + +up to 7 children + +- 128 bytes storage + +up to 14 children + +- 256 bytes storage + +up to 28 children + +- Type B: pool + +- 512 bytes storage +up to 56 children + +- 1024 bytes storage +up to 112 children + +- Type C: pigeon-hole array +- 1 cache line hit for lookup success +- 2048 bytes storage + +up to 256 children + + +* Analysis of node type encoding and node pointers: + +Lookups are _always_ from the top of the tree going down. This +facilitates RCU replacement as we only keep track of pointers going +downward. + +Type of node encoded in the parent's pointer. Need to reserve 2 +least-significant bits. + +Types of children: + +enum child_type { + RCU_JA_LINEAR = 0, /* Type A */ + /* 32-bit: 1 to 25 children, 8 to 128 bytes */ + /* 64-bit: 1 to 28 children, 16 to 256 bytes */ + RCU_JA_POOL = 1, /* Type B */ + /* 32-bit: 26 to 100 children, 256 to 512 bytes */ + /* 64-bit: 29 to 112 children, 512 to 1024 bytes */ + RCU_JA_PIGEON = 2, /* Type C */ + /* 32-bit: 101 to 256 children, 1024 bytes */ + /* 64-bit: 113 to 256 children, 2048 bytes */ + /* Leaf nodes are implicit from their height in the tree */ +}; + +If entire pointer is NULL, children is empty. + + +* Lookup and Update Algorithms + +Let's propose a quite simple scheme that uses a mutex on nodes to manage +update concurrency. It's certainly not optimal in terms of concurrency +management within a node, but it has the advantage of being simple to +implement and understand. + +We need to keep a count of the number of children nodes (for each node), +to keep track of when the node type thresholds are reached. It would be +important to put an hysteresis loop so we don't change between node +types too often for a loop on add/removal of the same node. + +We acquire locks from child to parent, nested. We take all locks +required to perform a given update in the tree (but no more) to keep it +consistent with respect to number of children per node. + +If check for node being gc'd (always under node lock) fails, we simply +need to release the lock and lookup the node again. + + +- Leaf lookup + +rcu_read_lock() + +RCU-lookup each level of the tree. If level is not populated, fail. +Until we reach the leaf node. + +rcu_read_unlock() + + +- Leaf insertion + +A) Lookup + +rcu_read_lock() +RCU-lookup insert position. Find location in tree where nodes are +missing for this insertion. If leaf is already present, insert fails, +releasing the rcu read lock. The insert location consists of a parent +node to which we want to attach a new node. + +B) Lock + +RCU-lookup parent node. Take the parent lock. If the parent needs to be +reallocated to make room for this insertion, RCU-lookup parent-parent +node and take the parent-parent lock. For each lock taken, check if +node is being gc'd. If gc'd, release lock, re-RCU-lookup this node, and +retry. + +C) Create + +Construct the whole branch from the new topmost intermediate node down +to the new leaf node we are inserting. + +D) Populate: + - If parent node reallocation is needed: + Reallocate the parent node, adding the new branch to it, and + increment its node count. + set gc flag in old nodes. + call_rcu free for all old nodes. + Populate new parent node with rcu_assign_pointer. + - Else: + Increment parent node count. + Use rcu_assign_pointer to populate this new branch into the parent + node. + +E) Locks + +Release parent and (if taken) parent-parent locks. +rcu_read_unlock() + + +- Leaf removal + +A) Lookup + +rcu_read_lock() +RCU-lookup leaf to remove. If leaf is missing, fail and release rcu +read lock. + +B) Lock + +RCU-lookup parent. Take the parent lock. If the parent needs to be +reallocated because it would be too large for the decremented number of +children, RCU-lookup parent-parent and take the parent-parent lock. Do +so recursively until no node reallocation is needed, or until root is +reached. + +For each lock taken, check if node is being gc'd. If gc'd, release lock, +re-RCU-lookup this node, and retry. + +C) Create + +The branch (or portion of branch) consisting of taken locks necessarily +has a simple node removal or update as operation to do on its top node. + +If the operation is a node removal, then, necessarily, the entire branch +under the node removal operation will simply disappear. No node +allocation is needed. + +Else, if the operation is a child node reallocation, the child node will +necessarily do a node removal. So _its_ entire child branch will +disappear. So reallocate this child node without the removed branch +(remember to decrement its nr children count). + +D) Populate + +No reallocation case: simply set the appropriate child pointer in the +topmost locked node to NULL. Decrement its nr children count. + +Reallocation case: set the child pointer in the topmost locked node to +the newly allocated node. +set old nodes gc flag. +call_rcu free for all old nodes. + +E) Locks + +Release all locks. +rcu_read_unlock() + + +For the various types of nodes: + +- sequential search (type A) + - RCU replacement: mutex + - Entry update: mutex + +- bitmap followed by pointer array (type B) + - RCU replacement: mutex + - Entry update: mutex + +- pigeon hole array (type C) + - RCU replacement: mutex + - Entry update: mutex