rcuja/design.txt

   1 RCU Judy Array Design
   2 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
   3 March 8, 2012
   4
   5 Initial ideas based on the released Judy Shop Manual
   6 (http://judy.sourceforge.net/). Judy was invented by Doug Baskins and
   7 implemented by Hewlett-Packard.
   8
   9 Thresholds and RCU-specific analysis is introduced in this document.
  10
  11 Advantages of using Judy Array (compressed nodes) for RCU tree:
  12 - no rebalancing
  13 - no transplant
  14 - RCU-friendly!
  15 - favor cache-line alignment of structures
  16
  17 Disadvantage:
  18 - updates that need to reallocate nodes are slower than, e.g. non-rcu
  19   red-black trees.
  20
  21 Choice: Using 256 entries intermediate nodes (index can be represented
  22 on 8 bits): 4 levels on 32-bit, 8 levels on 64-bit
  23
  24
  25 * Node types (from less dense node to most dense)
  26
  27
  28 - empty node:
  29
  30 Parent pointer is NULL.
  31
  32
  33 - Type A: sequential search in value and pointer arrays
  34
  35 + Add/removal just needs to update value and pointer array, single-entry
  36   (non-RCU...). For RCU, we might need to update the entire node anyway.
  37 - Requires sequential search through all value array for lookup fail
  38   test.
  39
  40 Filled at 3 entries max 64-bit
  41 8 bits indicating number of children
  42 Array of 8-bit values followed by array of associated pointers.
  43 64-bit: 1 byte + 3 bytes + 4 bytes pad + 3*8 = 32 bytes
  44
  45 -> up to this point on 64-bit, sequential lookup and pointer read fit in
  46 a 32-byte cache line.
  47   - lookup fail&success: 1 cache-line.
  48
  49 Filled at 6 entries max 32-bit, 7 entries max 64-bit
  50 8 bits indicating number of children
  51 Array of 8-bit values followed by array of associated pointers.
  52 32-bit: 1 byte + 6 bytes + 1 byte pad + 6*4bytes = 32 bytes
  53 64-bit: 1 byte + 7 bytes + 7*8 = 64 bytes
  54
  55 -> up to this point on 32-bit, sequential lookup and pointer read fit in
  56 a 32-byte cache line.
  57   - lookup fail&success: 1 cache-line.
  58
  59 Filled at 12 entries max 32-bit, 14 entries max 64-bit
  60 8 bits indicating number of children
  61 Array of 8-bit values followed by array of associated pointers.
  62 32-bit: 1 byte + 12 bytes + 3 bytes pad + 12*4bytes = 64 bytes
  63 64-bit: 1 byte + 14 bytes + 1 byte pad + 14*8 = 128 bytes
  64
  65 Filled at 25 entries max 32-bit, 28 entries max 64-bit
  66 8 bits indicating number of children
  67 Array of 8-bit values followed by array of associated pointers.
  68 32-bit: 1 byte + 25 bytes + 2 bytes pad + 25*4bytes = 128 bytes
  69 64-bit: 1 byte + 28 bytes + 3 bytes pad + 28*8 = 256 bytes
  70
  71 ---> up to this point, on both 32-bit and 64-bit, the sequential lookup
  72 in values array fits in a 32-byte cache line.
  73   - lookup failure: 1 cache line.
  74   - lookup success: 2 cache lines.
  75
  76 The two below are listed for completeness sake, but because they require
  77 2 32-byte cache lines for lookup, these are deemed inappropriate.
  78
  79 Filled at 51 entries max 32-bit, 56 entries max 64-bit
  80 8 bits indicating number of children
  81 Array of 8-bit values followed by array of associated pointers.
  82 32-bit: 1 byte + 51 bytes + 51*4bytes = 256 bytes
  83 64-bit: 1 byte + 56 bytes + 7 bytes pad + 56*8 = 512 bytes
  84
  85 Filled at 102 entries max 32-bit, 113 entries max 64-bit
  86 8 bits indicating number of children
  87 Array of 8-bit values followed by array of associated pointers.
  88 32-bit: 1 byte + 102 bytes + 1 byte pad + 102*4bytes = 512 bytes
  89 64-bit: 1 byte + 113 bytes + 6 bytes pad + 113*8 = 1024 bytes
  90
  91
  92 - Type B: pools of values and pointers arrays
  93
  94 Pools of values and pointers arrays. Each pool values array is 32-bytes
  95 in size (so it fits in a L1 cacheline). Each pool begins with an 8-bit
  96 integer, which is the number of children in this pool, followed by an
  97 array of 8-bit values, padding, and an array of pointers. Values and
  98 pointer arrays are associated as in Type A.
  99
 100 The entries of a node are associated to their respective pool based
 101 on their index position.
 102
 103 + Allows lookup failure to use 1 32-byte cache-line only. (1 cacheline)
 104   lookup success: 2 cache lines.
 105
 106 + Allows in-place updates without reallocation, except when a pool is
 107   full. (this was not possible with bitmap-based nodes)
 108 - If one pool exhausts its space, we need to increase the node size.
 109   Therefore, for very dense populations, we will end up using the
 110   pigeon-hole node type sooner, thus consuming more space.
 111
 112 Pool configuration:
 113
 114 Per pool, filled at 25 entries (32-bit), 28 entries (64-bit)
 115 32-bit: 1 byte + 25 bytes + 2 bytes pad + 25*4bytes = 128 bytes
 116 64-bit: 1 byte + 28 bytes + 3 bytes pad + 28*8 = 256 bytes
 117
 118 Total up to 50 entries (32-bit), 56 entries (64-bit)
 119 2 pools: 32-bit = 256 bytes
 120 2 pools: 64-bit = 512 bytes
 121
 122 Total up to 100 entries (32-bit), 112 entries (64-bit)
 123 4 pools: 32-bit = 512 bytes
 124 4 pools: 32-bit = 1024 bytes
 125
 126
 127 - Type C: pigeon-hole array
 128
 129 Filled at 47.2%/48.8% or more (32-bit: 121 entries+, 64-bit: 125 entries+)
 130 Array of children node pointers. Pointers NULL if no child at index.
 131 32-bit: 4*256 = 1024 bytes
 132 64-bit: 8*256 = 2048 bytes
 133
 134
 135 * Analysis of the thresholds:
 136
 137 Analysis of number of cache-lines touched for each node, per-node-type,
 138 depending on the number of children per node, as we increment the number
 139 of children from 0 to 256. Through this, we choose number of children
 140 thresholds at which it is worthwhile to use a different node type.
 141
 142 Per node:
 143
 144 - ALWAYS 1 cache line hit for lookup failure (all cases)
 145
 146 32-bit
 147
 148 - Unexisting
 149
 150 0 children
 151
 152 - Type A: sequential search in value and pointer arrays
 153 - 1 cache line hit for lookup success
 154 - 32 bytes storage
 155
 156 up to 6 children
 157
 158 - 2 cache line hit for lookup success
 159 - 64 bytes storage
 160
 161 up to 12 children
 162
 163 - 128 bytes storage
 164
 165 up to 25 children
 166
 167 - Type B: pool
 168
 169 - 256 bytes storage
 170
 171 up to 50 children
 172
 173 - 512 bytes storage
 174 up to 100 children
 175
 176 - Type C: pigeon-hole array
 177 - 1 cache line hit for lookup success
 178 - 1024 bytes storage
 179
 180 up to 256 children
 181
 182
 183 64-bit
 184
 185 - Unexisting
 186
 187 0 children
 188
 189 - Type A: sequential search in value and pointer arrays
 190 - 1 cache line hit for lookup success
 191 - 32 bytes storage
 192
 193 up to 3 children
 194
 195 - 2 cache line hit for lookup success
 196 - 64 bytes storage
 197
 198 up to 7 children
 199
 200 - 128 bytes storage
 201
 202 up to 14 children
 203
 204 - 256 bytes storage
 205
 206 up to 28 children
 207
 208 - Type B: pool
 209
 210 - 512 bytes storage
 211 up to 56 children
 212
 213 - 1024 bytes storage
 214 up to 112 children
 215
 216 - Type C: pigeon-hole array
 217 - 1 cache line hit for lookup success
 218 - 2048 bytes storage
 219
 220 up to 256 children
 221
 222
 223 * Analysis of node type encoding and node pointers:
 224
 225 Lookups are _always_ from the top of the tree going down. This
 226 facilitates RCU replacement as we only keep track of pointers going
 227 downward.
 228
 229 Type of node encoded in the parent's pointer. Need to reserve 2
 230 least-significant bits.
 231
 232 Types of children:
 233
 234 enum child_type {
 235         RCU_JA_LINEAR = 0,      /* Type A */
 236                         /* 32-bit: 1 to 25 children, 8 to 128 bytes */
 237                         /* 64-bit: 1 to 28 children, 16 to 256 bytes */
 238         RCU_JA_POOL = 1,        /* Type B */
 239                         /* 32-bit: 26 to 100 children, 256 to 512 bytes */
 240                         /* 64-bit: 29 to 112 children, 512 to 1024 bytes */
 241         RCU_JA_PIGEON = 2,      /* Type C */
 242                         /* 32-bit: 101 to 256 children, 1024 bytes */
 243                         /* 64-bit: 113 to 256 children, 2048 bytes */
 244         /* Leaf nodes are implicit from their height in the tree */
 245 };
 246
 247 If entire pointer is NULL, children is empty.
 248
 249
 250 * Lookup and Update Algorithms
 251
 252 Let's propose a quite simple scheme that uses a mutex on nodes to manage
 253 update concurrency. It's certainly not optimal in terms of concurrency
 254 management within a node, but it has the advantage of being simple to
 255 implement and understand.
 256
 257 We need to keep a count of the number of children nodes (for each node),
 258 to keep track of when the node type thresholds are reached. It would be
 259 important to put an hysteresis loop so we don't change between node
 260 types too often for a loop on add/removal of the same node.
 261
 262 We acquire locks from child to parent, nested. We take all locks
 263 required to perform a given update in the tree (but no more) to keep it
 264 consistent with respect to number of children per node.
 265
 266 If check for node being gc'd (always under node lock) fails, we simply
 267 need to release the lock and lookup the node again.
 268
 269
 270 - Leaf lookup
 271
 272 rcu_read_lock()
 273
 274 RCU-lookup each level of the tree. If level is not populated, fail.
 275 Until we reach the leaf node.
 276
 277 rcu_read_unlock()
 278
 279
 280 - Leaf insertion
 281
 282 A) Lookup
 283
 284 rcu_read_lock()
 285 RCU-lookup insert position. Find location in tree where nodes are
 286 missing for this insertion. If leaf is already present, insert fails,
 287 releasing the rcu read lock.  The insert location consists of a parent
 288 node to which we want to attach a new node.
 289
 290 B) Lock
 291
 292 RCU-lookup parent node. Take the parent lock. If the parent needs to be
 293 reallocated to make room for this insertion, RCU-lookup parent-parent
 294 node and take the parent-parent lock.  For each lock taken, check if
 295 node is being gc'd. If gc'd, release lock, re-RCU-lookup this node, and
 296 retry.
 297
 298 C) Create
 299
 300 Construct the whole branch from the new topmost intermediate node down
 301 to the new leaf node we are inserting.
 302
 303 D) Populate:
 304   - If parent node reallocation is needed:
 305      Reallocate the parent node, adding the new branch to it, and
 306      increment its node count.
 307      set gc flag in old nodes.
 308      call_rcu free for all old nodes.
 309      Populate new parent node with rcu_assign_pointer.
 310   - Else:
 311     Increment parent node count.
 312     Use rcu_assign_pointer to populate this new branch into the parent
 313     node.
 314
 315 E) Locks
 316
 317 Release parent and (if taken) parent-parent locks.
 318 rcu_read_unlock()
 319
 320
 321 - Leaf removal
 322
 323 A) Lookup
 324
 325 rcu_read_lock()
 326 RCU-lookup leaf to remove. If leaf is missing, fail and release rcu
 327 read lock.
 328
 329 B) Lock
 330
 331 RCU-lookup parent. Take the parent lock. If the parent needs to be
 332 reallocated because it would be too large for the decremented number of
 333 children, RCU-lookup parent-parent and take the parent-parent lock. Do
 334 so recursively until no node reallocation is needed, or until root is
 335 reached.
 336
 337 For each lock taken, check if node is being gc'd. If gc'd, release lock,
 338 re-RCU-lookup this node, and retry.
 339
 340 C) Create
 341
 342 The branch (or portion of branch) consisting of taken locks necessarily
 343 has a simple node removal or update as operation to do on its top node.
 344
 345 If the operation is a node removal, then, necessarily, the entire branch
 346 under the node removal operation will simply disappear. No node
 347 allocation is needed.
 348
 349 Else, if the operation is a child node reallocation, the child node will
 350 necessarily do a node removal. So _its_ entire child branch will
 351 disappear. So reallocate this child node without the removed branch
 352 (remember to decrement its nr children count).
 353
 354 D) Populate
 355
 356 No reallocation case: simply set the appropriate child pointer in the
 357 topmost locked node to NULL. Decrement its nr children count.
 358
 359 Reallocation case: set the child pointer in the topmost locked node to
 360 the newly allocated node.
 361 set old nodes gc flag.
 362 call_rcu free for all old nodes.
 363
 364 E) Locks
 365
 366 Release all locks.
 367 rcu_read_unlock()
 368
 369
 370 For the various types of nodes:
 371
 372 - sequential search (type A)
 373   - RCU replacement: mutex
 374   - Entry update: mutex
 375
 376 - bitmap followed by pointer array (type B)
 377   - RCU replacement: mutex
 378   - Entry update: mutex
 379
 380 - pigeon hole array (type C)
 381   - RCU replacement: mutex
 382   - Entry update: mutex