Use statistical approach to approximate the max number of nodes per population
[userspace-rcu.git] / rcuja / design.txt
CommitLineData
61009379
MD
1RCU Judy Array Design
2Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
3March 8, 2012
4
5Initial ideas based on the released Judy Shop Manual
6(http://judy.sourceforge.net/). Judy was invented by Doug Baskins and
7implemented by Hewlett-Packard.
8
9Thresholds and RCU-specific analysis is introduced in this document.
10
11Advantages of using Judy Array (compressed nodes) for RCU tree:
12- no rebalancing
13- no transplant
14- RCU-friendly!
15- favor cache-line alignment of structures
16
17Disadvantage:
18- updates that need to reallocate nodes are slower than, e.g. non-rcu
19 red-black trees.
20
21Choice: Using 256 entries intermediate nodes (index can be represented
22on 8 bits): 4 levels on 32-bit, 8 levels on 64-bit
23
24
25* Node types (from less dense node to most dense)
26
27
28- empty node:
29
30Parent pointer is NULL.
31
32
33- Type A: sequential search in value and pointer arrays
34
35+ Add/removal just needs to update value and pointer array, single-entry
36 (non-RCU...). For RCU, we might need to update the entire node anyway.
37- Requires sequential search through all value array for lookup fail
38 test.
39
40Filled at 3 entries max 64-bit
418 bits indicating number of children
42Array of 8-bit values followed by array of associated pointers.
4364-bit: 1 byte + 3 bytes + 4 bytes pad + 3*8 = 32 bytes
44
45-> up to this point on 64-bit, sequential lookup and pointer read fit in
46a 32-byte cache line.
47 - lookup fail&success: 1 cache-line.
48
49Filled at 6 entries max 32-bit, 7 entries max 64-bit
508 bits indicating number of children
51Array of 8-bit values followed by array of associated pointers.
5232-bit: 1 byte + 6 bytes + 1 byte pad + 6*4bytes = 32 bytes
5364-bit: 1 byte + 7 bytes + 7*8 = 64 bytes
54
55-> up to this point on 32-bit, sequential lookup and pointer read fit in
56a 32-byte cache line.
57 - lookup fail&success: 1 cache-line.
58
59Filled at 12 entries max 32-bit, 14 entries max 64-bit
608 bits indicating number of children
61Array of 8-bit values followed by array of associated pointers.
6232-bit: 1 byte + 12 bytes + 3 bytes pad + 12*4bytes = 64 bytes
6364-bit: 1 byte + 14 bytes + 1 byte pad + 14*8 = 128 bytes
64
65Filled at 25 entries max 32-bit, 28 entries max 64-bit
668 bits indicating number of children
67Array of 8-bit values followed by array of associated pointers.
6832-bit: 1 byte + 25 bytes + 2 bytes pad + 25*4bytes = 128 bytes
6964-bit: 1 byte + 28 bytes + 3 bytes pad + 28*8 = 256 bytes
70
71---> up to this point, on both 32-bit and 64-bit, the sequential lookup
72in values array fits in a 32-byte cache line.
73 - lookup failure: 1 cache line.
74 - lookup success: 2 cache lines.
75
76The two below are listed for completeness sake, but because they require
772 32-byte cache lines for lookup, these are deemed inappropriate.
78
79Filled at 51 entries max 32-bit, 56 entries max 64-bit
808 bits indicating number of children
81Array of 8-bit values followed by array of associated pointers.
8232-bit: 1 byte + 51 bytes + 51*4bytes = 256 bytes
8364-bit: 1 byte + 56 bytes + 7 bytes pad + 56*8 = 512 bytes
84
85Filled at 102 entries max 32-bit, 113 entries max 64-bit
868 bits indicating number of children
87Array of 8-bit values followed by array of associated pointers.
8832-bit: 1 byte + 102 bytes + 1 byte pad + 102*4bytes = 512 bytes
8964-bit: 1 byte + 113 bytes + 6 bytes pad + 113*8 = 1024 bytes
90
91
fd800776 92- Type B: pools of values and pointers arrays
61009379 93
fd800776
MD
94Pools of values and pointers arrays. Each pool values array is 32-bytes
95in size (so it fits in a L1 cacheline). Each pool begins with an 8-bit
96integer, which is the number of children in this pool, followed by an
97array of 8-bit values, padding, and an array of pointers. Values and
98pointer arrays are associated as in Type A.
61009379 99
fd800776
MD
100The entries of a node are associated to their respective pool based
101on their index position.
61009379 102
fd800776
MD
103+ Allows lookup failure to use 1 32-byte cache-line only. (1 cacheline)
104 lookup success: 2 cache lines.
61009379 105
fd800776
MD
106+ Allows in-place updates without reallocation, except when a pool is
107 full. (this was not possible with bitmap-based nodes)
108- If one pool exhausts its space, we need to increase the node size.
109 Therefore, for very dense populations, we will end up using the
110 pigeon-hole node type sooner, thus consuming more space.
61009379 111
fd800776 112Pool configuration:
61009379 113
fd800776
MD
114Per pool, filled at 25 entries (32-bit), 28 entries (64-bit)
11532-bit: 1 byte + 25 bytes + 2 bytes pad + 25*4bytes = 128 bytes
11664-bit: 1 byte + 28 bytes + 3 bytes pad + 28*8 = 256 bytes
117
118Total up to 50 entries (32-bit), 56 entries (64-bit)
1192 pools: 32-bit = 256 bytes
1202 pools: 64-bit = 512 bytes
121
122Total up to 100 entries (32-bit), 112 entries (64-bit)
1234 pools: 32-bit = 512 bytes
1244 pools: 32-bit = 1024 bytes
61009379
MD
125
126
fa89978a
MD
127* Choice of pool configuration distribution:
128
129We have pools of either 2 or 4 linear arrays. Their total size is
130between 256 bytes (32-bit 2 arrays) and 1024 bytes (64-bit 4 arrays).
131
132Alignment on 256 bytes means that we can spare the 8 least significant
133bits of the pointers. Given that the type selection already uses 3 bits,
134we have 7 bits left.
135
136Alignment on 512 bytes -> 8 bits left.
137
138We can therefore encode which bit, or which two bits, are used as
139distribution selection. We can use this technique to reequilibrate pools
140if they become unbalanced (e.g. all children are within one of the two
141linear arrays).
142
143Assuming that finding the exact sub-pool usage maximum for any given
144distribution is NP complete (not proven).
145
146Taking into account unbalance ratios (tested programmatically by
147randomly taking N entries from 256, calculating the distribution for
148each bit (number of nodes for which bit is one/zero), and calculating
149the difference in number of nodes for each bit, choosing the minimum
150difference -- for millions of runs).
151
152tot entries unbalance largest linear array (stat. approx.)
153---------------------------------------------------------------------
3d45251f
MD
15448 entries: 2 (98%) 24+1=25 (target ~50/2=25)
15554 entries: 2 (97%) 27+1=28 (target ~56/2=28)
fa89978a
MD
156
157Note: there exists rare worse cases where the unbalance is larger, but
158it happens _very_ rarely. But need to provide a fallback if the subclass
159does not fit, but it does not need to be efficient.
160
161
162For pool of size 4, we need to approximate what is the maximum unbalance
163we can get for choice of distributions grouped by pairs of bits.
164
165tot entries unbalance largest linear array (stat. approx.)
166---------------------------------------------------------------------
3d45251f
MD
16792 entries: 8 (99%) 23+2=25 (target: ~100/4=25)
168104 entries: 8 (99%) 26+2=28 (target: ~112/4=28)
fa89978a
MD
169
170
171Note: there exists rare worse cases where the unbalance is larger, but
172it happens _very_ rarely. But need to provide a fallback if the subclass
173does not fit, but it does not need to be efficient.
174
175
176* Population "does not fit" and distribution fallback
177
178When adding a child to a distribution node, if the child does not fit,
179we recalculate the best distribution. If it does not fit in that
180distribution neither, we need to expand the node type.
181
182When removing a child, if the node child count is brought to the number
183of entries expected to statistically fit in the lower order node, we try
184to shrink. However, if we notice that the distribution does not actually
185fit in that shrinked node, we abort the shrink operation. If shrink
186fails, we keep a counter of insertion/removal operations on the node
187before we allow the shrink to be attempted again.
188
189
61009379
MD
190- Type C: pigeon-hole array
191
192Filled at 47.2%/48.8% or more (32-bit: 121 entries+, 64-bit: 125 entries+)
193Array of children node pointers. Pointers NULL if no child at index.
19432-bit: 4*256 = 1024 bytes
19564-bit: 8*256 = 2048 bytes
196
197
198* Analysis of the thresholds:
199
200Analysis of number of cache-lines touched for each node, per-node-type,
201depending on the number of children per node, as we increment the number
202of children from 0 to 256. Through this, we choose number of children
203thresholds at which it is worthwhile to use a different node type.
204
205Per node:
206
207- ALWAYS 1 cache line hit for lookup failure (all cases)
208
20932-bit
210
211- Unexisting
212
2130 children
214
215- Type A: sequential search in value and pointer arrays
216- 1 cache line hit for lookup success
217- 32 bytes storage
218
219up to 6 children
220
221- 2 cache line hit for lookup success
222- 64 bytes storage
223
224up to 12 children
225
61009379
MD
226- 128 bytes storage
227
fd800776
MD
228up to 25 children
229
230- Type B: pool
61009379
MD
231
232- 256 bytes storage
fd800776
MD
233
234up to 50 children
61009379
MD
235
236- 512 bytes storage
fd800776 237up to 100 children
61009379
MD
238
239- Type C: pigeon-hole array
240- 1 cache line hit for lookup success
241- 1024 bytes storage
242
243up to 256 children
244
245
24664-bit
247
248- Unexisting
249
2500 children
251
252- Type A: sequential search in value and pointer arrays
253- 1 cache line hit for lookup success
254- 32 bytes storage
255
256up to 3 children
257
258- 2 cache line hit for lookup success
259- 64 bytes storage
260
261up to 7 children
262
263- 128 bytes storage
264
265up to 14 children
266
61009379
MD
267- 256 bytes storage
268
269up to 28 children
270
fd800776
MD
271- Type B: pool
272
61009379 273- 512 bytes storage
fd800776 274up to 56 children
61009379
MD
275
276- 1024 bytes storage
fd800776 277up to 112 children
61009379
MD
278
279- Type C: pigeon-hole array
280- 1 cache line hit for lookup success
281- 2048 bytes storage
282
283up to 256 children
284
285
286* Analysis of node type encoding and node pointers:
287
288Lookups are _always_ from the top of the tree going down. This
289facilitates RCU replacement as we only keep track of pointers going
290downward.
291
292Type of node encoded in the parent's pointer. Need to reserve 2
293least-significant bits.
294
295Types of children:
296
297enum child_type {
e5227865 298 RCU_JA_LINEAR = 0, /* Type A */
fd800776
MD
299 /* 32-bit: 1 to 25 children, 8 to 128 bytes */
300 /* 64-bit: 1 to 28 children, 16 to 256 bytes */
301 RCU_JA_POOL = 1, /* Type B */
302 /* 32-bit: 26 to 100 children, 256 to 512 bytes */
303 /* 64-bit: 29 to 112 children, 512 to 1024 bytes */
e5227865 304 RCU_JA_PIGEON = 2, /* Type C */
fd800776
MD
305 /* 32-bit: 101 to 256 children, 1024 bytes */
306 /* 64-bit: 113 to 256 children, 2048 bytes */
e5227865 307 /* Leaf nodes are implicit from their height in the tree */
61009379
MD
308};
309
310If entire pointer is NULL, children is empty.
311
312
313* Lookup and Update Algorithms
314
315Let's propose a quite simple scheme that uses a mutex on nodes to manage
316update concurrency. It's certainly not optimal in terms of concurrency
317management within a node, but it has the advantage of being simple to
318implement and understand.
319
320We need to keep a count of the number of children nodes (for each node),
321to keep track of when the node type thresholds are reached. It would be
322important to put an hysteresis loop so we don't change between node
323types too often for a loop on add/removal of the same node.
324
325We acquire locks from child to parent, nested. We take all locks
326required to perform a given update in the tree (but no more) to keep it
327consistent with respect to number of children per node.
328
329If check for node being gc'd (always under node lock) fails, we simply
330need to release the lock and lookup the node again.
331
332
333- Leaf lookup
334
335rcu_read_lock()
336
337RCU-lookup each level of the tree. If level is not populated, fail.
338Until we reach the leaf node.
339
340rcu_read_unlock()
341
342
343- Leaf insertion
344
345A) Lookup
346
347rcu_read_lock()
348RCU-lookup insert position. Find location in tree where nodes are
349missing for this insertion. If leaf is already present, insert fails,
350releasing the rcu read lock. The insert location consists of a parent
351node to which we want to attach a new node.
352
353B) Lock
354
355RCU-lookup parent node. Take the parent lock. If the parent needs to be
356reallocated to make room for this insertion, RCU-lookup parent-parent
357node and take the parent-parent lock. For each lock taken, check if
358node is being gc'd. If gc'd, release lock, re-RCU-lookup this node, and
359retry.
360
361C) Create
362
363Construct the whole branch from the new topmost intermediate node down
364to the new leaf node we are inserting.
365
366D) Populate:
367 - If parent node reallocation is needed:
368 Reallocate the parent node, adding the new branch to it, and
369 increment its node count.
370 set gc flag in old nodes.
371 call_rcu free for all old nodes.
372 Populate new parent node with rcu_assign_pointer.
373 - Else:
374 Increment parent node count.
375 Use rcu_assign_pointer to populate this new branch into the parent
376 node.
377
378E) Locks
379
380Release parent and (if taken) parent-parent locks.
381rcu_read_unlock()
382
383
384- Leaf removal
385
386A) Lookup
387
388rcu_read_lock()
389RCU-lookup leaf to remove. If leaf is missing, fail and release rcu
390read lock.
391
392B) Lock
393
394RCU-lookup parent. Take the parent lock. If the parent needs to be
395reallocated because it would be too large for the decremented number of
396children, RCU-lookup parent-parent and take the parent-parent lock. Do
397so recursively until no node reallocation is needed, or until root is
398reached.
399
400For each lock taken, check if node is being gc'd. If gc'd, release lock,
401re-RCU-lookup this node, and retry.
402
403C) Create
404
405The branch (or portion of branch) consisting of taken locks necessarily
406has a simple node removal or update as operation to do on its top node.
407
408If the operation is a node removal, then, necessarily, the entire branch
409under the node removal operation will simply disappear. No node
410allocation is needed.
411
412Else, if the operation is a child node reallocation, the child node will
413necessarily do a node removal. So _its_ entire child branch will
414disappear. So reallocate this child node without the removed branch
415(remember to decrement its nr children count).
416
417D) Populate
418
419No reallocation case: simply set the appropriate child pointer in the
420topmost locked node to NULL. Decrement its nr children count.
421
422Reallocation case: set the child pointer in the topmost locked node to
423the newly allocated node.
424set old nodes gc flag.
425call_rcu free for all old nodes.
426
427E) Locks
428
429Release all locks.
430rcu_read_unlock()
431
432
433For the various types of nodes:
434
435- sequential search (type A)
436 - RCU replacement: mutex
437 - Entry update: mutex
438
439- bitmap followed by pointer array (type B)
440 - RCU replacement: mutex
441 - Entry update: mutex
442
443- pigeon hole array (type C)
444 - RCU replacement: mutex
445 - Entry update: mutex
This page took 0.04101 seconds and 4 git commands to generate.