rcuja/design.txt

   1 RCU Judy Array Design
   2 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
   3 March 8, 2012
   4
   5 Initial ideas based on the released Judy Shop Manual
   6 (http://judy.sourceforge.net/). Judy was invented by Doug Baskins and
   7 implemented by Hewlett-Packard.
   8
   9 Thresholds and RCU-specific analysis is introduced in this document.
  10
  11 Advantages of using Judy Array (compressed nodes) for RCU tree:
  12 - no rebalancing
  13 - no transplant
  14 - RCU-friendly!
  15 - favor cache-line alignment of structures
  16
  17 Disadvantage:
  18 - updates that need to reallocate nodes are slower than, e.g. non-rcu
  19   red-black trees.
  20
  21 Choice: Using 256 entries intermediate nodes (index can be represented
  22 on 8 bits): 4 levels on 32-bit, 8 levels on 64-bit
  23
  24
  25 * Node types (from less dense node to most dense)
  26
  27
  28 - empty node:
  29
  30 Parent pointer is NULL.
  31
  32
  33 - Type A: sequential search in value and pointer arrays
  34
  35 + Add/removal just needs to update value and pointer array, single-entry
  36   (non-RCU...). For RCU, we might need to update the entire node anyway.
  37 - Requires sequential search through all value array for lookup fail
  38   test.
  39
  40 Filled at 3 entries max 64-bit
  41 8 bits indicating number of children
  42 Array of 8-bit values followed by array of associated pointers.
  43 64-bit: 1 byte + 3 bytes + 4 bytes pad + 3*8 = 32 bytes
  44
  45 -> up to this point on 64-bit, sequential lookup and pointer read fit in
  46 a 32-byte cache line.
  47   - lookup fail&success: 1 cache-line.
  48
  49 Filled at 6 entries max 32-bit, 7 entries max 64-bit
  50 8 bits indicating number of children
  51 Array of 8-bit values followed by array of associated pointers.
  52 32-bit: 1 byte + 6 bytes + 1 byte pad + 6*4bytes = 32 bytes
  53 64-bit: 1 byte + 7 bytes + 7*8 = 64 bytes
  54
  55 -> up to this point on 32-bit, sequential lookup and pointer read fit in
  56 a 32-byte cache line.
  57   - lookup fail&success: 1 cache-line.
  58
  59 Filled at 12 entries max 32-bit, 14 entries max 64-bit
  60 8 bits indicating number of children
  61 Array of 8-bit values followed by array of associated pointers.
  62 32-bit: 1 byte + 12 bytes + 3 bytes pad + 12*4bytes = 64 bytes
  63 64-bit: 1 byte + 14 bytes + 1 byte pad + 14*8 = 128 bytes
  64
  65 Filled at 25 entries max 32-bit, 28 entries max 64-bit
  66 8 bits indicating number of children
  67 Array of 8-bit values followed by array of associated pointers.
  68 32-bit: 1 byte + 25 bytes + 2 bytes pad + 25*4bytes = 128 bytes
  69 64-bit: 1 byte + 28 bytes + 3 bytes pad + 28*8 = 256 bytes
  70
  71 ---> up to this point, on both 32-bit and 64-bit, the sequential lookup
  72 in values array fits in a 32-byte cache line.
  73   - lookup failure: 1 cache line.
  74   - lookup success: 2 cache lines.
  75
  76 The two below are listed for completeness sake, but because they require
  77 2 32-byte cache lines for lookup, these are deemed inappropriate.
  78
  79 Filled at 51 entries max 32-bit, 56 entries max 64-bit
  80 8 bits indicating number of children
  81 Array of 8-bit values followed by array of associated pointers.
  82 32-bit: 1 byte + 51 bytes + 51*4bytes = 256 bytes
  83 64-bit: 1 byte + 56 bytes + 7 bytes pad + 56*8 = 512 bytes
  84
  85 Filled at 102 entries max 32-bit, 113 entries max 64-bit
  86 8 bits indicating number of children
  87 Array of 8-bit values followed by array of associated pointers.
  88 32-bit: 1 byte + 102 bytes + 1 byte pad + 102*4bytes = 512 bytes
  89 64-bit: 1 byte + 113 bytes + 6 bytes pad + 113*8 = 1024 bytes
  90
  91
  92 - Type B: bitmap, followed by pointers array.
  93
  94 bitmask (256 entries -> 256 bits -> 32 bytes) of populated children
  95 followed by an array of children pointers, in same order as appears in
  96 the bitmask
  97
  98 + Allows lookup failure to use 32-byte cache-line only. (1 cacheline)
  99 + Allows lookup success to use 32-byte cache-line (bitmap),
 100   followed by direct lookup into pointer array. (2 cachelines)
 101
 102 Filled at 8 entries 32-bit, 12 entries 64-bit
 103 32-bit: 32 + (4*8) ->  64 bytes
 104 64-bit: 32 + (8*12) ->  128 bytes
 105
 106 Filled at 24 entries 32-bit, 28 entries 64-bit
 107 32-bit: 32 + (4*24) ->  128 bytes
 108 64-bit: 32 + (8*28) ->  256 bytes
 109
 110 Filled at 56 entries 32-bit, 60 entries 64-bit
 111 32-bit: 32 + (4*56) ->  256 bytes
 112 64-bit: 32 + (8*60) ->  512 bytes
 113
 114 Filled at 120 entries 32-bit, 124 entries 64-bit
 115 32-bit: 32 + (4*95) ->  512 bytes
 116 64-bit: 32 + (8*124) ->  1024 bytes
 117
 118
 119 - Type C: pigeon-hole array
 120
 121 Filled at 47.2%/48.8% or more (32-bit: 121 entries+, 64-bit: 125 entries+)
 122 Array of children node pointers. Pointers NULL if no child at index.
 123 32-bit: 4*256 = 1024 bytes
 124 64-bit: 8*256 = 2048 bytes
 125
 126
 127 * Analysis of the thresholds:
 128
 129 Analysis of number of cache-lines touched for each node, per-node-type,
 130 depending on the number of children per node, as we increment the number
 131 of children from 0 to 256. Through this, we choose number of children
 132 thresholds at which it is worthwhile to use a different node type.
 133
 134 Per node:
 135
 136 - ALWAYS 1 cache line hit for lookup failure (all cases)
 137
 138 32-bit
 139
 140 - Unexisting
 141
 142 0 children
 143
 144 - Type A: sequential search in value and pointer arrays
 145 - 1 cache line hit for lookup success
 146 - 32 bytes storage
 147
 148 up to 6 children
 149
 150 - 2 cache line hit for lookup success
 151 - 64 bytes storage
 152
 153 up to 12 children
 154
 155 - Type B: bitmap, followed by pointers array.
 156 - 2 cache line hit for lookup success
 157 - 128 bytes storage
 158
 159 up to 24 children
 160
 161 - 256 bytes storage
 162 up to 56 children
 163
 164 - 512 bytes storage
 165 up to 120 children
 166
 167 - Type C: pigeon-hole array
 168 - 1 cache line hit for lookup success
 169 - 1024 bytes storage
 170
 171 up to 256 children
 172
 173
 174 64-bit
 175
 176 - Unexisting
 177
 178 0 children
 179
 180 - Type A: sequential search in value and pointer arrays
 181 - 1 cache line hit for lookup success
 182 - 32 bytes storage
 183
 184 up to 3 children
 185
 186 - 2 cache line hit for lookup success
 187 - 64 bytes storage
 188
 189 up to 7 children
 190
 191 - 128 bytes storage
 192
 193 up to 14 children
 194
 195 - Type B: bitmap, followed by pointers array.
 196 - 2 cache line hit for lookup success
 197 - 256 bytes storage
 198
 199 up to 28 children
 200
 201 - 512 bytes storage
 202 up to 60 children
 203
 204 - 1024 bytes storage
 205 up to 124 children
 206
 207 - Type C: pigeon-hole array
 208 - 1 cache line hit for lookup success
 209 - 2048 bytes storage
 210
 211 up to 256 children
 212
 213
 214 * Analysis of node type encoding and node pointers:
 215
 216 Lookups are _always_ from the top of the tree going down. This
 217 facilitates RCU replacement as we only keep track of pointers going
 218 downward.
 219
 220 Type of node encoded in the parent's pointer. Need to reserve 2
 221 least-significant bits.
 222
 223 Types of children:
 224
 225 enum child_type {
 226         LINEAR = 0,     /* Type A */
 227                         /* 32-bit: 1 to 12 children, 32 to 64 bytes */
 228                         /* 64-bit: 1 to 14 children, 32 to 128 bytes */
 229         BITMAP = 1,     /* Type B */
 230                         /* 32-bit: 13 to 120 children, 128 to 512 bytes */
 231                         /* 64-bit: 15 to 124 children, 256 to 1024 bytes */
 232         PIGEON = 2,     /* Type C */
 233                         /* 32-bit: 121 to 256 children, 1024 bytes */
 234                         /* 64-bit: 125 to 256 children, 2048 bytes */
 235         LEAF   = 3,     /* Leaf node */
 236 };
 237
 238 If entire pointer is NULL, children is empty.
 239
 240
 241 * Lookup and Update Algorithms
 242
 243 Let's propose a quite simple scheme that uses a mutex on nodes to manage
 244 update concurrency. It's certainly not optimal in terms of concurrency
 245 management within a node, but it has the advantage of being simple to
 246 implement and understand.
 247
 248 We need to keep a count of the number of children nodes (for each node),
 249 to keep track of when the node type thresholds are reached. It would be
 250 important to put an hysteresis loop so we don't change between node
 251 types too often for a loop on add/removal of the same node.
 252
 253 We acquire locks from child to parent, nested. We take all locks
 254 required to perform a given update in the tree (but no more) to keep it
 255 consistent with respect to number of children per node.
 256
 257 If check for node being gc'd (always under node lock) fails, we simply
 258 need to release the lock and lookup the node again.
 259
 260
 261 - Leaf lookup
 262
 263 rcu_read_lock()
 264
 265 RCU-lookup each level of the tree. If level is not populated, fail.
 266 Until we reach the leaf node.
 267
 268 rcu_read_unlock()
 269
 270
 271 - Leaf insertion
 272
 273 A) Lookup
 274
 275 rcu_read_lock()
 276 RCU-lookup insert position. Find location in tree where nodes are
 277 missing for this insertion. If leaf is already present, insert fails,
 278 releasing the rcu read lock.  The insert location consists of a parent
 279 node to which we want to attach a new node.
 280
 281 B) Lock
 282
 283 RCU-lookup parent node. Take the parent lock. If the parent needs to be
 284 reallocated to make room for this insertion, RCU-lookup parent-parent
 285 node and take the parent-parent lock.  For each lock taken, check if
 286 node is being gc'd. If gc'd, release lock, re-RCU-lookup this node, and
 287 retry.
 288
 289 C) Create
 290
 291 Construct the whole branch from the new topmost intermediate node down
 292 to the new leaf node we are inserting.
 293
 294 D) Populate:
 295   - If parent node reallocation is needed:
 296      Reallocate the parent node, adding the new branch to it, and
 297      increment its node count.
 298      set gc flag in old nodes.
 299      call_rcu free for all old nodes.
 300      Populate new parent node with rcu_assign_pointer.
 301   - Else:
 302     Increment parent node count.
 303     Use rcu_assign_pointer to populate this new branch into the parent
 304     node.
 305
 306 E) Locks
 307
 308 Release parent and (if taken) parent-parent locks.
 309 rcu_read_unlock()
 310
 311
 312 - Leaf removal
 313
 314 A) Lookup
 315
 316 rcu_read_lock()
 317 RCU-lookup leaf to remove. If leaf is missing, fail and release rcu
 318 read lock.
 319
 320 B) Lock
 321
 322 RCU-lookup parent. Take the parent lock. If the parent needs to be
 323 reallocated because it would be too large for the decremented number of
 324 children, RCU-lookup parent-parent and take the parent-parent lock. Do
 325 so recursively until no node reallocation is needed, or until root is
 326 reached.
 327
 328 For each lock taken, check if node is being gc'd. If gc'd, release lock,
 329 re-RCU-lookup this node, and retry.
 330
 331 C) Create
 332
 333 The branch (or portion of branch) consisting of taken locks necessarily
 334 has a simple node removal or update as operation to do on its top node.
 335
 336 If the operation is a node removal, then, necessarily, the entire branch
 337 under the node removal operation will simply disappear. No node
 338 allocation is needed.
 339
 340 Else, if the operation is a child node reallocation, the child node will
 341 necessarily do a node removal. So _its_ entire child branch will
 342 disappear. So reallocate this child node without the removed branch
 343 (remember to decrement its nr children count).
 344
 345 D) Populate
 346
 347 No reallocation case: simply set the appropriate child pointer in the
 348 topmost locked node to NULL. Decrement its nr children count.
 349
 350 Reallocation case: set the child pointer in the topmost locked node to
 351 the newly allocated node.
 352 set old nodes gc flag.
 353 call_rcu free for all old nodes.
 354
 355 E) Locks
 356
 357 Release all locks.
 358 rcu_read_unlock()
 359
 360
 361 For the various types of nodes:
 362
 363 - sequential search (type A)
 364   - RCU replacement: mutex
 365   - Entry update: mutex
 366
 367 - bitmap followed by pointer array (type B)
 368   - RCU replacement: mutex
 369   - Entry update: mutex
 370
 371 - pigeon hole array (type C)
 372   - RCU replacement: mutex
 373   - Entry update: mutex