rcuja: use pool of linear array instead of bitmap
[userspace-rcu.git] / rcuja / design.txt
1 RCU Judy Array Design
2 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
3 March 8, 2012
4
5 Initial ideas based on the released Judy Shop Manual
6 (http://judy.sourceforge.net/). Judy was invented by Doug Baskins and
7 implemented by Hewlett-Packard.
8
9 Thresholds and RCU-specific analysis is introduced in this document.
10
11 Advantages of using Judy Array (compressed nodes) for RCU tree:
12 - no rebalancing
13 - no transplant
14 - RCU-friendly!
15 - favor cache-line alignment of structures
16
17 Disadvantage:
18 - updates that need to reallocate nodes are slower than, e.g. non-rcu
19 red-black trees.
20
21 Choice: Using 256 entries intermediate nodes (index can be represented
22 on 8 bits): 4 levels on 32-bit, 8 levels on 64-bit
23
24
25 * Node types (from less dense node to most dense)
26
27
28 - empty node:
29
30 Parent pointer is NULL.
31
32
33 - Type A: sequential search in value and pointer arrays
34
35 + Add/removal just needs to update value and pointer array, single-entry
36 (non-RCU...). For RCU, we might need to update the entire node anyway.
37 - Requires sequential search through all value array for lookup fail
38 test.
39
40 Filled at 3 entries max 64-bit
41 8 bits indicating number of children
42 Array of 8-bit values followed by array of associated pointers.
43 64-bit: 1 byte + 3 bytes + 4 bytes pad + 3*8 = 32 bytes
44
45 -> up to this point on 64-bit, sequential lookup and pointer read fit in
46 a 32-byte cache line.
47 - lookup fail&success: 1 cache-line.
48
49 Filled at 6 entries max 32-bit, 7 entries max 64-bit
50 8 bits indicating number of children
51 Array of 8-bit values followed by array of associated pointers.
52 32-bit: 1 byte + 6 bytes + 1 byte pad + 6*4bytes = 32 bytes
53 64-bit: 1 byte + 7 bytes + 7*8 = 64 bytes
54
55 -> up to this point on 32-bit, sequential lookup and pointer read fit in
56 a 32-byte cache line.
57 - lookup fail&success: 1 cache-line.
58
59 Filled at 12 entries max 32-bit, 14 entries max 64-bit
60 8 bits indicating number of children
61 Array of 8-bit values followed by array of associated pointers.
62 32-bit: 1 byte + 12 bytes + 3 bytes pad + 12*4bytes = 64 bytes
63 64-bit: 1 byte + 14 bytes + 1 byte pad + 14*8 = 128 bytes
64
65 Filled at 25 entries max 32-bit, 28 entries max 64-bit
66 8 bits indicating number of children
67 Array of 8-bit values followed by array of associated pointers.
68 32-bit: 1 byte + 25 bytes + 2 bytes pad + 25*4bytes = 128 bytes
69 64-bit: 1 byte + 28 bytes + 3 bytes pad + 28*8 = 256 bytes
70
71 ---> up to this point, on both 32-bit and 64-bit, the sequential lookup
72 in values array fits in a 32-byte cache line.
73 - lookup failure: 1 cache line.
74 - lookup success: 2 cache lines.
75
76 The two below are listed for completeness sake, but because they require
77 2 32-byte cache lines for lookup, these are deemed inappropriate.
78
79 Filled at 51 entries max 32-bit, 56 entries max 64-bit
80 8 bits indicating number of children
81 Array of 8-bit values followed by array of associated pointers.
82 32-bit: 1 byte + 51 bytes + 51*4bytes = 256 bytes
83 64-bit: 1 byte + 56 bytes + 7 bytes pad + 56*8 = 512 bytes
84
85 Filled at 102 entries max 32-bit, 113 entries max 64-bit
86 8 bits indicating number of children
87 Array of 8-bit values followed by array of associated pointers.
88 32-bit: 1 byte + 102 bytes + 1 byte pad + 102*4bytes = 512 bytes
89 64-bit: 1 byte + 113 bytes + 6 bytes pad + 113*8 = 1024 bytes
90
91
92 - Type B: pools of values and pointers arrays
93
94 Pools of values and pointers arrays. Each pool values array is 32-bytes
95 in size (so it fits in a L1 cacheline). Each pool begins with an 8-bit
96 integer, which is the number of children in this pool, followed by an
97 array of 8-bit values, padding, and an array of pointers. Values and
98 pointer arrays are associated as in Type A.
99
100 The entries of a node are associated to their respective pool based
101 on their index position.
102
103 + Allows lookup failure to use 1 32-byte cache-line only. (1 cacheline)
104 lookup success: 2 cache lines.
105
106 + Allows in-place updates without reallocation, except when a pool is
107 full. (this was not possible with bitmap-based nodes)
108 - If one pool exhausts its space, we need to increase the node size.
109 Therefore, for very dense populations, we will end up using the
110 pigeon-hole node type sooner, thus consuming more space.
111
112 Pool configuration:
113
114 Per pool, filled at 25 entries (32-bit), 28 entries (64-bit)
115 32-bit: 1 byte + 25 bytes + 2 bytes pad + 25*4bytes = 128 bytes
116 64-bit: 1 byte + 28 bytes + 3 bytes pad + 28*8 = 256 bytes
117
118 Total up to 50 entries (32-bit), 56 entries (64-bit)
119 2 pools: 32-bit = 256 bytes
120 2 pools: 64-bit = 512 bytes
121
122 Total up to 100 entries (32-bit), 112 entries (64-bit)
123 4 pools: 32-bit = 512 bytes
124 4 pools: 32-bit = 1024 bytes
125
126
127 - Type C: pigeon-hole array
128
129 Filled at 47.2%/48.8% or more (32-bit: 121 entries+, 64-bit: 125 entries+)
130 Array of children node pointers. Pointers NULL if no child at index.
131 32-bit: 4*256 = 1024 bytes
132 64-bit: 8*256 = 2048 bytes
133
134
135 * Analysis of the thresholds:
136
137 Analysis of number of cache-lines touched for each node, per-node-type,
138 depending on the number of children per node, as we increment the number
139 of children from 0 to 256. Through this, we choose number of children
140 thresholds at which it is worthwhile to use a different node type.
141
142 Per node:
143
144 - ALWAYS 1 cache line hit for lookup failure (all cases)
145
146 32-bit
147
148 - Unexisting
149
150 0 children
151
152 - Type A: sequential search in value and pointer arrays
153 - 1 cache line hit for lookup success
154 - 32 bytes storage
155
156 up to 6 children
157
158 - 2 cache line hit for lookup success
159 - 64 bytes storage
160
161 up to 12 children
162
163 - 128 bytes storage
164
165 up to 25 children
166
167 - Type B: pool
168
169 - 256 bytes storage
170
171 up to 50 children
172
173 - 512 bytes storage
174 up to 100 children
175
176 - Type C: pigeon-hole array
177 - 1 cache line hit for lookup success
178 - 1024 bytes storage
179
180 up to 256 children
181
182
183 64-bit
184
185 - Unexisting
186
187 0 children
188
189 - Type A: sequential search in value and pointer arrays
190 - 1 cache line hit for lookup success
191 - 32 bytes storage
192
193 up to 3 children
194
195 - 2 cache line hit for lookup success
196 - 64 bytes storage
197
198 up to 7 children
199
200 - 128 bytes storage
201
202 up to 14 children
203
204 - 256 bytes storage
205
206 up to 28 children
207
208 - Type B: pool
209
210 - 512 bytes storage
211 up to 56 children
212
213 - 1024 bytes storage
214 up to 112 children
215
216 - Type C: pigeon-hole array
217 - 1 cache line hit for lookup success
218 - 2048 bytes storage
219
220 up to 256 children
221
222
223 * Analysis of node type encoding and node pointers:
224
225 Lookups are _always_ from the top of the tree going down. This
226 facilitates RCU replacement as we only keep track of pointers going
227 downward.
228
229 Type of node encoded in the parent's pointer. Need to reserve 2
230 least-significant bits.
231
232 Types of children:
233
234 enum child_type {
235 RCU_JA_LINEAR = 0, /* Type A */
236 /* 32-bit: 1 to 25 children, 8 to 128 bytes */
237 /* 64-bit: 1 to 28 children, 16 to 256 bytes */
238 RCU_JA_POOL = 1, /* Type B */
239 /* 32-bit: 26 to 100 children, 256 to 512 bytes */
240 /* 64-bit: 29 to 112 children, 512 to 1024 bytes */
241 RCU_JA_PIGEON = 2, /* Type C */
242 /* 32-bit: 101 to 256 children, 1024 bytes */
243 /* 64-bit: 113 to 256 children, 2048 bytes */
244 /* Leaf nodes are implicit from their height in the tree */
245 };
246
247 If entire pointer is NULL, children is empty.
248
249
250 * Lookup and Update Algorithms
251
252 Let's propose a quite simple scheme that uses a mutex on nodes to manage
253 update concurrency. It's certainly not optimal in terms of concurrency
254 management within a node, but it has the advantage of being simple to
255 implement and understand.
256
257 We need to keep a count of the number of children nodes (for each node),
258 to keep track of when the node type thresholds are reached. It would be
259 important to put an hysteresis loop so we don't change between node
260 types too often for a loop on add/removal of the same node.
261
262 We acquire locks from child to parent, nested. We take all locks
263 required to perform a given update in the tree (but no more) to keep it
264 consistent with respect to number of children per node.
265
266 If check for node being gc'd (always under node lock) fails, we simply
267 need to release the lock and lookup the node again.
268
269
270 - Leaf lookup
271
272 rcu_read_lock()
273
274 RCU-lookup each level of the tree. If level is not populated, fail.
275 Until we reach the leaf node.
276
277 rcu_read_unlock()
278
279
280 - Leaf insertion
281
282 A) Lookup
283
284 rcu_read_lock()
285 RCU-lookup insert position. Find location in tree where nodes are
286 missing for this insertion. If leaf is already present, insert fails,
287 releasing the rcu read lock. The insert location consists of a parent
288 node to which we want to attach a new node.
289
290 B) Lock
291
292 RCU-lookup parent node. Take the parent lock. If the parent needs to be
293 reallocated to make room for this insertion, RCU-lookup parent-parent
294 node and take the parent-parent lock. For each lock taken, check if
295 node is being gc'd. If gc'd, release lock, re-RCU-lookup this node, and
296 retry.
297
298 C) Create
299
300 Construct the whole branch from the new topmost intermediate node down
301 to the new leaf node we are inserting.
302
303 D) Populate:
304 - If parent node reallocation is needed:
305 Reallocate the parent node, adding the new branch to it, and
306 increment its node count.
307 set gc flag in old nodes.
308 call_rcu free for all old nodes.
309 Populate new parent node with rcu_assign_pointer.
310 - Else:
311 Increment parent node count.
312 Use rcu_assign_pointer to populate this new branch into the parent
313 node.
314
315 E) Locks
316
317 Release parent and (if taken) parent-parent locks.
318 rcu_read_unlock()
319
320
321 - Leaf removal
322
323 A) Lookup
324
325 rcu_read_lock()
326 RCU-lookup leaf to remove. If leaf is missing, fail and release rcu
327 read lock.
328
329 B) Lock
330
331 RCU-lookup parent. Take the parent lock. If the parent needs to be
332 reallocated because it would be too large for the decremented number of
333 children, RCU-lookup parent-parent and take the parent-parent lock. Do
334 so recursively until no node reallocation is needed, or until root is
335 reached.
336
337 For each lock taken, check if node is being gc'd. If gc'd, release lock,
338 re-RCU-lookup this node, and retry.
339
340 C) Create
341
342 The branch (or portion of branch) consisting of taken locks necessarily
343 has a simple node removal or update as operation to do on its top node.
344
345 If the operation is a node removal, then, necessarily, the entire branch
346 under the node removal operation will simply disappear. No node
347 allocation is needed.
348
349 Else, if the operation is a child node reallocation, the child node will
350 necessarily do a node removal. So _its_ entire child branch will
351 disappear. So reallocate this child node without the removed branch
352 (remember to decrement its nr children count).
353
354 D) Populate
355
356 No reallocation case: simply set the appropriate child pointer in the
357 topmost locked node to NULL. Decrement its nr children count.
358
359 Reallocation case: set the child pointer in the topmost locked node to
360 the newly allocated node.
361 set old nodes gc flag.
362 call_rcu free for all old nodes.
363
364 E) Locks
365
366 Release all locks.
367 rcu_read_unlock()
368
369
370 For the various types of nodes:
371
372 - sequential search (type A)
373 - RCU replacement: mutex
374 - Entry update: mutex
375
376 - bitmap followed by pointer array (type B)
377 - RCU replacement: mutex
378 - Entry update: mutex
379
380 - pigeon hole array (type C)
381 - RCU replacement: mutex
382 - Entry update: mutex
This page took 0.039548 seconds and 5 git commands to generate.