rcuja/design.txt

   1 RCU Judy Array Design
   2 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
   3 March 8, 2012
   4
   5 Initial ideas based on the released Judy Shop Manual
   6 (http://judy.sourceforge.net/). Judy was invented by Doug Baskins and
   7 implemented by Hewlett-Packard.
   8
   9 Thresholds and RCU-specific analysis is introduced in this document.
  10
  11 Advantages of using Judy Array (compressed nodes) for RCU tree:
  12 - no rebalancing
  13 - no transplant
  14 - RCU-friendly!
  15 - favor cache-line alignment of structures
  16
  17 Disadvantage:
  18 - updates that need to reallocate nodes are slower than, e.g. non-rcu
  19   red-black trees.
  20
  21 Choice: Using 256 entries intermediate nodes (index can be represented
  22 on 8 bits): 4 levels on 32-bit, 8 levels on 64-bit
  23
  24
  25 * Node types (from less dense node to most dense)
  26
  27
  28 - empty node:
  29
  30 Parent pointer is NULL.
  31
  32
  33 - Type A: sequential search in value and pointer arrays
  34
  35 + Add/removal just needs to update value and pointer array, single-entry
  36   (non-RCU...). For RCU, we might need to update the entire node anyway.
  37 - Requires sequential search through all value array for lookup fail
  38   test.
  39
  40 Filled at 3 entries max 64-bit
  41 8 bits indicating number of children
  42 Array of 8-bit values followed by array of associated pointers.
  43 64-bit: 1 byte + 3 bytes + 4 bytes pad + 3*8 = 32 bytes
  44
  45 -> up to this point on 64-bit, sequential lookup and pointer read fit in
  46 a 32-byte cache line.
  47   - lookup fail&success: 1 cache-line.
  48
  49 Filled at 6 entries max 32-bit, 7 entries max 64-bit
  50 8 bits indicating number of children
  51 Array of 8-bit values followed by array of associated pointers.
  52 32-bit: 1 byte + 6 bytes + 1 byte pad + 6*4bytes = 32 bytes
  53 64-bit: 1 byte + 7 bytes + 7*8 = 64 bytes
  54
  55 -> up to this point on 32-bit, sequential lookup and pointer read fit in
  56 a 32-byte cache line.
  57   - lookup fail&success: 1 cache-line.
  58
  59 Filled at 12 entries max 32-bit, 14 entries max 64-bit
  60 8 bits indicating number of children
  61 Array of 8-bit values followed by array of associated pointers.
  62 32-bit: 1 byte + 12 bytes + 3 bytes pad + 12*4bytes = 64 bytes
  63 64-bit: 1 byte + 14 bytes + 1 byte pad + 14*8 = 128 bytes
  64
  65 Filled at 25 entries max 32-bit, 28 entries max 64-bit
  66 8 bits indicating number of children
  67 Array of 8-bit values followed by array of associated pointers.
  68 32-bit: 1 byte + 25 bytes + 2 bytes pad + 25*4bytes = 128 bytes
  69 64-bit: 1 byte + 28 bytes + 3 bytes pad + 28*8 = 256 bytes
  70
  71 ---> up to this point, on both 32-bit and 64-bit, the sequential lookup
  72 in values array fits in a 32-byte cache line.
  73   - lookup failure: 1 cache line.
  74   - lookup success: 2 cache lines.
  75
  76 The two below are listed for completeness sake, but because they require
  77 2 32-byte cache lines for lookup, these are deemed inappropriate.
  78
  79 Filled at 51 entries max 32-bit, 56 entries max 64-bit
  80 8 bits indicating number of children
  81 Array of 8-bit values followed by array of associated pointers.
  82 32-bit: 1 byte + 51 bytes + 51*4bytes = 256 bytes
  83 64-bit: 1 byte + 56 bytes + 7 bytes pad + 56*8 = 512 bytes
  84
  85 Filled at 102 entries max 32-bit, 113 entries max 64-bit
  86 8 bits indicating number of children
  87 Array of 8-bit values followed by array of associated pointers.
  88 32-bit: 1 byte + 102 bytes + 1 byte pad + 102*4bytes = 512 bytes
  89 64-bit: 1 byte + 113 bytes + 6 bytes pad + 113*8 = 1024 bytes
  90
  91
  92 - Type B: pools of values and pointers arrays
  93
  94 Pools of values and pointers arrays. Each pool values array is 32-bytes
  95 in size (so it fits in a L1 cacheline). Each pool begins with an 8-bit
  96 integer, which is the number of children in this pool, followed by an
  97 array of 8-bit values, padding, and an array of pointers. Values and
  98 pointer arrays are associated as in Type A.
  99
 100 The entries of a node are associated to their respective pool based
 101 on their index position.
 102
 103 + Allows lookup failure to use 1 32-byte cache-line only. (1 cacheline)
 104   lookup success: 2 cache lines.
 105
 106 + Allows in-place updates without reallocation, except when a pool is
 107   full. (this was not possible with bitmap-based nodes)
 108 - If one pool exhausts its space, we need to increase the node size.
 109   Therefore, for very dense populations, we will end up using the
 110   pigeon-hole node type sooner, thus consuming more space.
 111
 112 Pool configuration:
 113
 114 Per pool, filled at 25 entries (32-bit), 28 entries (64-bit)
 115 32-bit: 1 byte + 25 bytes + 2 bytes pad + 25*4bytes = 128 bytes
 116 64-bit: 1 byte + 28 bytes + 3 bytes pad + 28*8 = 256 bytes
 117
 118 Total up to 50 entries (32-bit), 56 entries (64-bit)
 119 2 pools: 32-bit = 256 bytes
 120 2 pools: 64-bit = 512 bytes
 121
 122 Total up to 100 entries (32-bit), 112 entries (64-bit)
 123 4 pools: 32-bit = 512 bytes
 124 4 pools: 32-bit = 1024 bytes
 125
 126
 127 * Choice of pool configuration distribution:
 128
 129 We have pools of either 2 or 4 linear arrays. Their total size is
 130 between 256 bytes (32-bit 2 arrays) and 1024 bytes (64-bit 4 arrays).
 131
 132 Alignment on 256 bytes means that we can spare the 8 least significant
 133 bits of the pointers. Given that the type selection already uses 3 bits,
 134 we have 7 bits left.
 135
 136 Alignment on 512 bytes -> 8 bits left.
 137
 138 We can therefore encode which bit, or which two bits, are used as
 139 distribution selection. We can use this technique to reequilibrate pools
 140 if they become unbalanced (e.g. all children are within one of the two
 141 linear arrays).
 142
 143 Assuming that finding the exact sub-pool usage maximum for any given
 144 distribution is NP complete (not proven).
 145
 146 Taking into account sub-class size unbalance (tested programmatically by
 147 randomly taking N entries from 256, calculating the distribution for
 148 each bit (number of nodes for which bit is one/zero), and calculating
 149 the difference in number of nodes for each bit, choosing the minimum
 150 difference -- for millions of runs).
 151
 152 We start from the "ideal" population size (with no unbalance), and do a
 153 fixed point to find the appropriate population size.
 154
 155 tot entries    subclass extra items       largest linear array (stat. approx.)
 156 ---------------------------------------------------------------------
 157 48 entries:      1 (98%)                  24+1=25 (target ~50/2=25)
 158 54 entries:      1 (97%)                  27+1=28 (target ~56/2=28)
 159
 160 Note: there exists rare worse cases where the unbalance is larger, but
 161 it happens _very_ rarely. But need to provide a fallback if the subclass
 162 does not fit, but it does not need to be efficient.
 163
 164
 165 For pool of size 4, we need to approximate what is the maximum unbalance
 166 we can get for choice of distributions grouped by pairs of bits.
 167
 168 tot entries     subclass extra items   largest linear array (stat. approx.)
 169 ---------------------------------------------------------------------
 170 92 entries:      2 (99%)                23+2=25  (target: ~100/4=25)
 171 104 entries:     2 (99%)                26+2=28  (target: ~112/4=28)
 172
 173
 174 Note: there exists rare worse cases where the unbalance is larger, but
 175 it happens _very_ rarely. But need to provide a fallback if the subclass
 176 does not fit, but it does not need to be efficient.
 177
 178
 179 * Population "does not fit" and distribution fallback
 180
 181 When adding a child to a distribution node, if the child does not fit,
 182 we recalculate the best distribution. If it does not fit in that
 183 distribution neither, we need to expand the node type.
 184
 185 When removing a child, if the node child count is brought to the number
 186 of entries expected to statistically fit in the lower order node, we try
 187 to shrink. However, if we notice that the distribution does not actually
 188 fit in that shrinked node, we abort the shrink operation. If shrink
 189 fails, we keep a counter of insertion/removal operations on the node
 190 before we allow the shrink to be attempted again.
 191
 192
 193 - Type C: pigeon-hole array
 194
 195 Filled at 47.2%/48.8% or more (32-bit: 121 entries+, 64-bit: 125 entries+)
 196 Array of children node pointers. Pointers NULL if no child at index.
 197 32-bit: 4*256 = 1024 bytes
 198 64-bit: 8*256 = 2048 bytes
 199
 200
 201 * Analysis of the thresholds:
 202
 203 Analysis of number of cache-lines touched for each node, per-node-type,
 204 depending on the number of children per node, as we increment the number
 205 of children from 0 to 256. Through this, we choose number of children
 206 thresholds at which it is worthwhile to use a different node type.
 207
 208 Per node:
 209
 210 - ALWAYS 1 cache line hit for lookup failure (all cases)
 211
 212 32-bit
 213
 214 - Unexisting
 215
 216 0 children
 217
 218 - Type A: sequential search in value and pointer arrays
 219 - 1 cache line hit for lookup success
 220 - 32 bytes storage
 221
 222 up to 6 children
 223
 224 - 2 cache line hit for lookup success
 225 - 64 bytes storage
 226
 227 up to 12 children
 228
 229 - 128 bytes storage
 230
 231 up to 25 children
 232
 233 - Type B: pool
 234
 235 - 256 bytes storage
 236
 237 up to 50 children
 238
 239 - 512 bytes storage
 240 up to 100 children
 241
 242 - Type C: pigeon-hole array
 243 - 1 cache line hit for lookup success
 244 - 1024 bytes storage
 245
 246 up to 256 children
 247
 248
 249 64-bit
 250
 251 - Unexisting
 252
 253 0 children
 254
 255 - Type A: sequential search in value and pointer arrays
 256 - 1 cache line hit for lookup success
 257 - 32 bytes storage
 258
 259 up to 3 children
 260
 261 - 2 cache line hit for lookup success
 262 - 64 bytes storage
 263
 264 up to 7 children
 265
 266 - 128 bytes storage
 267
 268 up to 14 children
 269
 270 - 256 bytes storage
 271
 272 up to 28 children
 273
 274 - Type B: pool
 275
 276 - 512 bytes storage
 277 up to 56 children
 278
 279 - 1024 bytes storage
 280 up to 112 children
 281
 282 - Type C: pigeon-hole array
 283 - 1 cache line hit for lookup success
 284 - 2048 bytes storage
 285
 286 up to 256 children
 287
 288
 289 * Analysis of node type encoding and node pointers:
 290
 291 Lookups are _always_ from the top of the tree going down. This
 292 facilitates RCU replacement as we only keep track of pointers going
 293 downward.
 294
 295 Type of node encoded in the parent's pointer. Need to reserve 2
 296 least-significant bits.
 297
 298 Types of children:
 299
 300 enum child_type {
 301         RCU_JA_LINEAR = 0,      /* Type A */
 302                         /* 32-bit: 1 to 25 children, 8 to 128 bytes */
 303                         /* 64-bit: 1 to 28 children, 16 to 256 bytes */
 304         RCU_JA_POOL = 1,        /* Type B */
 305                         /* 32-bit: 26 to 100 children, 256 to 512 bytes */
 306                         /* 64-bit: 29 to 112 children, 512 to 1024 bytes */
 307         RCU_JA_PIGEON = 2,      /* Type C */
 308                         /* 32-bit: 101 to 256 children, 1024 bytes */
 309                         /* 64-bit: 113 to 256 children, 2048 bytes */
 310         /* Leaf nodes are implicit from their height in the tree */
 311 };
 312
 313 If entire pointer is NULL, children is empty.
 314
 315
 316 * Lookup and Update Algorithms
 317
 318 Let's propose a quite simple scheme that uses a mutex on nodes to manage
 319 update concurrency. It's certainly not optimal in terms of concurrency
 320 management within a node, but it has the advantage of being simple to
 321 implement and understand.
 322
 323 We need to keep a count of the number of children nodes (for each node),
 324 to keep track of when the node type thresholds are reached. It would be
 325 important to put an hysteresis loop so we don't change between node
 326 types too often for a loop on add/removal of the same node.
 327
 328 We acquire locks from child to parent, nested. We take all locks
 329 required to perform a given update in the tree (but no more) to keep it
 330 consistent with respect to number of children per node.
 331
 332 If check for node being gc'd (always under node lock) fails, we simply
 333 need to release the lock and lookup the node again.
 334
 335
 336 - Leaf lookup
 337
 338 rcu_read_lock()
 339
 340 RCU-lookup each level of the tree. If level is not populated, fail.
 341 Until we reach the leaf node.
 342
 343 rcu_read_unlock()
 344
 345
 346 - Leaf insertion
 347
 348 A) Lookup
 349
 350 rcu_read_lock()
 351 RCU-lookup insert position. Find location in tree where nodes are
 352 missing for this insertion. If leaf is already present, insert fails,
 353 releasing the rcu read lock.  The insert location consists of a parent
 354 node to which we want to attach a new node.
 355
 356 B) Lock
 357
 358 RCU-lookup parent node. Take the parent lock. If the parent needs to be
 359 reallocated to make room for this insertion, RCU-lookup parent-parent
 360 node and take the parent-parent lock.  For each lock taken, check if
 361 node is being gc'd. If gc'd, release lock, re-RCU-lookup this node, and
 362 retry.
 363
 364 C) Create
 365
 366 Construct the whole branch from the new topmost intermediate node down
 367 to the new leaf node we are inserting.
 368
 369 D) Populate:
 370   - If parent node reallocation is needed:
 371      Reallocate the parent node, adding the new branch to it, and
 372      increment its node count.
 373      set gc flag in old nodes.
 374      call_rcu free for all old nodes.
 375      Populate new parent node with rcu_assign_pointer.
 376   - Else:
 377     Increment parent node count.
 378     Use rcu_assign_pointer to populate this new branch into the parent
 379     node.
 380
 381 E) Locks
 382
 383 Release parent and (if taken) parent-parent locks.
 384 rcu_read_unlock()
 385
 386
 387 - Leaf removal
 388
 389 A) Lookup
 390
 391 rcu_read_lock()
 392 RCU-lookup leaf to remove. If leaf is missing, fail and release rcu
 393 read lock.
 394
 395 B) Lock
 396
 397 RCU-lookup parent. Take the parent lock. If the parent needs to be
 398 reallocated because it would be too large for the decremented number of
 399 children, RCU-lookup parent-parent and take the parent-parent lock. Do
 400 so recursively until no node reallocation is needed, or until root is
 401 reached.
 402
 403 For each lock taken, check if node is being gc'd. If gc'd, release lock,
 404 re-RCU-lookup this node, and retry.
 405
 406 C) Create
 407
 408 The branch (or portion of branch) consisting of taken locks necessarily
 409 has a simple node removal or update as operation to do on its top node.
 410
 411 If the operation is a node removal, then, necessarily, the entire branch
 412 under the node removal operation will simply disappear. No node
 413 allocation is needed.
 414
 415 Else, if the operation is a child node reallocation, the child node will
 416 necessarily do a node removal. So _its_ entire child branch will
 417 disappear. So reallocate this child node without the removed branch
 418 (remember to decrement its nr children count).
 419
 420 D) Populate
 421
 422 No reallocation case: simply set the appropriate child pointer in the
 423 topmost locked node to NULL. Decrement its nr children count.
 424
 425 Reallocation case: set the child pointer in the topmost locked node to
 426 the newly allocated node.
 427 set old nodes gc flag.
 428 call_rcu free for all old nodes.
 429
 430 E) Locks
 431
 432 Release all locks.
 433 rcu_read_unlock()
 434
 435
 436 For the various types of nodes:
 437
 438 - sequential search (type A)
 439   - RCU replacement: mutex
 440   - Entry update: mutex
 441
 442 - bitmap followed by pointer array (type B)
 443   - RCU replacement: mutex
 444   - Entry update: mutex
 445
 446 - pigeon hole array (type C)
 447   - RCU replacement: mutex
 448   - Entry update: mutex