RCU Judy Array Design and initial files
[userspace-rcu.git] / rcuja / design.txt
1 RCU Judy Array Design
2 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
3 March 8, 2012
4
5 Initial ideas based on the released Judy Shop Manual
6 (http://judy.sourceforge.net/). Judy was invented by Doug Baskins and
7 implemented by Hewlett-Packard.
8
9 Thresholds and RCU-specific analysis is introduced in this document.
10
11 Advantages of using Judy Array (compressed nodes) for RCU tree:
12 - no rebalancing
13 - no transplant
14 - RCU-friendly!
15 - favor cache-line alignment of structures
16
17 Disadvantage:
18 - updates that need to reallocate nodes are slower than, e.g. non-rcu
19 red-black trees.
20
21 Choice: Using 256 entries intermediate nodes (index can be represented
22 on 8 bits): 4 levels on 32-bit, 8 levels on 64-bit
23
24
25 * Node types (from less dense node to most dense)
26
27
28 - empty node:
29
30 Parent pointer is NULL.
31
32
33 - Type A: sequential search in value and pointer arrays
34
35 + Add/removal just needs to update value and pointer array, single-entry
36 (non-RCU...). For RCU, we might need to update the entire node anyway.
37 - Requires sequential search through all value array for lookup fail
38 test.
39
40 Filled at 3 entries max 64-bit
41 8 bits indicating number of children
42 Array of 8-bit values followed by array of associated pointers.
43 64-bit: 1 byte + 3 bytes + 4 bytes pad + 3*8 = 32 bytes
44
45 -> up to this point on 64-bit, sequential lookup and pointer read fit in
46 a 32-byte cache line.
47 - lookup fail&success: 1 cache-line.
48
49 Filled at 6 entries max 32-bit, 7 entries max 64-bit
50 8 bits indicating number of children
51 Array of 8-bit values followed by array of associated pointers.
52 32-bit: 1 byte + 6 bytes + 1 byte pad + 6*4bytes = 32 bytes
53 64-bit: 1 byte + 7 bytes + 7*8 = 64 bytes
54
55 -> up to this point on 32-bit, sequential lookup and pointer read fit in
56 a 32-byte cache line.
57 - lookup fail&success: 1 cache-line.
58
59 Filled at 12 entries max 32-bit, 14 entries max 64-bit
60 8 bits indicating number of children
61 Array of 8-bit values followed by array of associated pointers.
62 32-bit: 1 byte + 12 bytes + 3 bytes pad + 12*4bytes = 64 bytes
63 64-bit: 1 byte + 14 bytes + 1 byte pad + 14*8 = 128 bytes
64
65 Filled at 25 entries max 32-bit, 28 entries max 64-bit
66 8 bits indicating number of children
67 Array of 8-bit values followed by array of associated pointers.
68 32-bit: 1 byte + 25 bytes + 2 bytes pad + 25*4bytes = 128 bytes
69 64-bit: 1 byte + 28 bytes + 3 bytes pad + 28*8 = 256 bytes
70
71 ---> up to this point, on both 32-bit and 64-bit, the sequential lookup
72 in values array fits in a 32-byte cache line.
73 - lookup failure: 1 cache line.
74 - lookup success: 2 cache lines.
75
76 The two below are listed for completeness sake, but because they require
77 2 32-byte cache lines for lookup, these are deemed inappropriate.
78
79 Filled at 51 entries max 32-bit, 56 entries max 64-bit
80 8 bits indicating number of children
81 Array of 8-bit values followed by array of associated pointers.
82 32-bit: 1 byte + 51 bytes + 51*4bytes = 256 bytes
83 64-bit: 1 byte + 56 bytes + 7 bytes pad + 56*8 = 512 bytes
84
85 Filled at 102 entries max 32-bit, 113 entries max 64-bit
86 8 bits indicating number of children
87 Array of 8-bit values followed by array of associated pointers.
88 32-bit: 1 byte + 102 bytes + 1 byte pad + 102*4bytes = 512 bytes
89 64-bit: 1 byte + 113 bytes + 6 bytes pad + 113*8 = 1024 bytes
90
91
92 - Type B: bitmap, followed by pointers array.
93
94 bitmask (256 entries -> 256 bits -> 32 bytes) of populated children
95 followed by an array of children pointers, in same order as appears in
96 the bitmask
97
98 + Allows lookup failure to use 32-byte cache-line only. (1 cacheline)
99 + Allows lookup success to use 32-byte cache-line (bitmap),
100 followed by direct lookup into pointer array. (2 cachelines)
101
102 Filled at 8 entries 32-bit, 12 entries 64-bit
103 32-bit: 32 + (4*8) -> 64 bytes
104 64-bit: 32 + (8*12) -> 128 bytes
105
106 Filled at 24 entries 32-bit, 28 entries 64-bit
107 32-bit: 32 + (4*24) -> 128 bytes
108 64-bit: 32 + (8*28) -> 256 bytes
109
110 Filled at 56 entries 32-bit, 60 entries 64-bit
111 32-bit: 32 + (4*56) -> 256 bytes
112 64-bit: 32 + (8*60) -> 512 bytes
113
114 Filled at 120 entries 32-bit, 124 entries 64-bit
115 32-bit: 32 + (4*95) -> 512 bytes
116 64-bit: 32 + (8*124) -> 1024 bytes
117
118
119 - Type C: pigeon-hole array
120
121 Filled at 47.2%/48.8% or more (32-bit: 121 entries+, 64-bit: 125 entries+)
122 Array of children node pointers. Pointers NULL if no child at index.
123 32-bit: 4*256 = 1024 bytes
124 64-bit: 8*256 = 2048 bytes
125
126
127 * Analysis of the thresholds:
128
129 Analysis of number of cache-lines touched for each node, per-node-type,
130 depending on the number of children per node, as we increment the number
131 of children from 0 to 256. Through this, we choose number of children
132 thresholds at which it is worthwhile to use a different node type.
133
134 Per node:
135
136 - ALWAYS 1 cache line hit for lookup failure (all cases)
137
138 32-bit
139
140 - Unexisting
141
142 0 children
143
144 - Type A: sequential search in value and pointer arrays
145 - 1 cache line hit for lookup success
146 - 32 bytes storage
147
148 up to 6 children
149
150 - 2 cache line hit for lookup success
151 - 64 bytes storage
152
153 up to 12 children
154
155 - Type B: bitmap, followed by pointers array.
156 - 2 cache line hit for lookup success
157 - 128 bytes storage
158
159 up to 24 children
160
161 - 256 bytes storage
162 up to 56 children
163
164 - 512 bytes storage
165 up to 120 children
166
167 - Type C: pigeon-hole array
168 - 1 cache line hit for lookup success
169 - 1024 bytes storage
170
171 up to 256 children
172
173
174 64-bit
175
176 - Unexisting
177
178 0 children
179
180 - Type A: sequential search in value and pointer arrays
181 - 1 cache line hit for lookup success
182 - 32 bytes storage
183
184 up to 3 children
185
186 - 2 cache line hit for lookup success
187 - 64 bytes storage
188
189 up to 7 children
190
191 - 128 bytes storage
192
193 up to 14 children
194
195 - Type B: bitmap, followed by pointers array.
196 - 2 cache line hit for lookup success
197 - 256 bytes storage
198
199 up to 28 children
200
201 - 512 bytes storage
202 up to 60 children
203
204 - 1024 bytes storage
205 up to 124 children
206
207 - Type C: pigeon-hole array
208 - 1 cache line hit for lookup success
209 - 2048 bytes storage
210
211 up to 256 children
212
213
214 * Analysis of node type encoding and node pointers:
215
216 Lookups are _always_ from the top of the tree going down. This
217 facilitates RCU replacement as we only keep track of pointers going
218 downward.
219
220 Type of node encoded in the parent's pointer. Need to reserve 2
221 least-significant bits.
222
223 Types of children:
224
225 enum child_type {
226 LINEAR = 0, /* Type A */
227 /* 32-bit: 1 to 12 children, 32 to 64 bytes */
228 /* 64-bit: 1 to 14 children, 32 to 128 bytes */
229 BITMAP = 1, /* Type B */
230 /* 32-bit: 13 to 120 children, 128 to 512 bytes */
231 /* 64-bit: 15 to 124 children, 256 to 1024 bytes */
232 PIGEON = 2, /* Type C */
233 /* 32-bit: 121 to 256 children, 1024 bytes */
234 /* 64-bit: 125 to 256 children, 2048 bytes */
235 LEAF = 3, /* Leaf node */
236 };
237
238 If entire pointer is NULL, children is empty.
239
240
241 * Lookup and Update Algorithms
242
243 Let's propose a quite simple scheme that uses a mutex on nodes to manage
244 update concurrency. It's certainly not optimal in terms of concurrency
245 management within a node, but it has the advantage of being simple to
246 implement and understand.
247
248 We need to keep a count of the number of children nodes (for each node),
249 to keep track of when the node type thresholds are reached. It would be
250 important to put an hysteresis loop so we don't change between node
251 types too often for a loop on add/removal of the same node.
252
253 We acquire locks from child to parent, nested. We take all locks
254 required to perform a given update in the tree (but no more) to keep it
255 consistent with respect to number of children per node.
256
257 If check for node being gc'd (always under node lock) fails, we simply
258 need to release the lock and lookup the node again.
259
260
261 - Leaf lookup
262
263 rcu_read_lock()
264
265 RCU-lookup each level of the tree. If level is not populated, fail.
266 Until we reach the leaf node.
267
268 rcu_read_unlock()
269
270
271 - Leaf insertion
272
273 A) Lookup
274
275 rcu_read_lock()
276 RCU-lookup insert position. Find location in tree where nodes are
277 missing for this insertion. If leaf is already present, insert fails,
278 releasing the rcu read lock. The insert location consists of a parent
279 node to which we want to attach a new node.
280
281 B) Lock
282
283 RCU-lookup parent node. Take the parent lock. If the parent needs to be
284 reallocated to make room for this insertion, RCU-lookup parent-parent
285 node and take the parent-parent lock. For each lock taken, check if
286 node is being gc'd. If gc'd, release lock, re-RCU-lookup this node, and
287 retry.
288
289 C) Create
290
291 Construct the whole branch from the new topmost intermediate node down
292 to the new leaf node we are inserting.
293
294 D) Populate:
295 - If parent node reallocation is needed:
296 Reallocate the parent node, adding the new branch to it, and
297 increment its node count.
298 set gc flag in old nodes.
299 call_rcu free for all old nodes.
300 Populate new parent node with rcu_assign_pointer.
301 - Else:
302 Increment parent node count.
303 Use rcu_assign_pointer to populate this new branch into the parent
304 node.
305
306 E) Locks
307
308 Release parent and (if taken) parent-parent locks.
309 rcu_read_unlock()
310
311
312 - Leaf removal
313
314 A) Lookup
315
316 rcu_read_lock()
317 RCU-lookup leaf to remove. If leaf is missing, fail and release rcu
318 read lock.
319
320 B) Lock
321
322 RCU-lookup parent. Take the parent lock. If the parent needs to be
323 reallocated because it would be too large for the decremented number of
324 children, RCU-lookup parent-parent and take the parent-parent lock. Do
325 so recursively until no node reallocation is needed, or until root is
326 reached.
327
328 For each lock taken, check if node is being gc'd. If gc'd, release lock,
329 re-RCU-lookup this node, and retry.
330
331 C) Create
332
333 The branch (or portion of branch) consisting of taken locks necessarily
334 has a simple node removal or update as operation to do on its top node.
335
336 If the operation is a node removal, then, necessarily, the entire branch
337 under the node removal operation will simply disappear. No node
338 allocation is needed.
339
340 Else, if the operation is a child node reallocation, the child node will
341 necessarily do a node removal. So _its_ entire child branch will
342 disappear. So reallocate this child node without the removed branch
343 (remember to decrement its nr children count).
344
345 D) Populate
346
347 No reallocation case: simply set the appropriate child pointer in the
348 topmost locked node to NULL. Decrement its nr children count.
349
350 Reallocation case: set the child pointer in the topmost locked node to
351 the newly allocated node.
352 set old nodes gc flag.
353 call_rcu free for all old nodes.
354
355 E) Locks
356
357 Release all locks.
358 rcu_read_unlock()
359
360
361 For the various types of nodes:
362
363 - sequential search (type A)
364 - RCU replacement: mutex
365 - Entry update: mutex
366
367 - bitmap followed by pointer array (type B)
368 - RCU replacement: mutex
369 - Entry update: mutex
370
371 - pigeon hole array (type C)
372 - RCU replacement: mutex
373 - Entry update: mutex
This page took 0.037731 seconds and 4 git commands to generate.