[userspace-rcu.git] / rcuja / design.txt

RCU Judy Array Design
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
March 8, 2012

Initial ideas based on the released Judy Shop Manual
(http://judy.sourceforge.net/). Judy was invented by Doug Baskins and
implemented by Hewlett-Packard.

Thresholds and RCU-specific analysis is introduced in this document.

Advantages of using Judy Array (compressed nodes) for RCU tree:
- no rebalancing
- no transplant
- RCU-friendly!
- favor cache-line alignment of structures

Disadvantage:
- updates that need to reallocate nodes are slower than, e.g. non-rcu
  red-black trees.

Choice: Using 256 entries intermediate nodes (index can be represented
on 8 bits): 4 levels on 32-bit, 8 levels on 64-bit


* Node types (from less dense node to most dense)


- empty node:

Parent pointer is NULL.


- Type A: sequential search in value and pointer arrays

+ Add/removal just needs to update value and pointer array, single-entry
  (non-RCU...). For RCU, we might need to update the entire node anyway.
- Requires sequential search through all value array for lookup fail
  test.

Filled at 3 entries max 64-bit
8 bits indicating number of children
Array of 8-bit values followed by array of associated pointers.
64-bit: 1 byte + 3 bytes + 4 bytes pad + 3*8 = 32 bytes

-> up to this point on 64-bit, sequential lookup and pointer read fit in
a 32-byte cache line.
  - lookup fail&success: 1 cache-line.

Filled at 6 entries max 32-bit, 7 entries max 64-bit
8 bits indicating number of children
Array of 8-bit values followed by array of associated pointers.
32-bit: 1 byte + 6 bytes + 1 byte pad + 6*4bytes = 32 bytes
64-bit: 1 byte + 7 bytes + 7*8 = 64 bytes

-> up to this point on 32-bit, sequential lookup and pointer read fit in
a 32-byte cache line.
  - lookup fail&success: 1 cache-line.

Filled at 12 entries max 32-bit, 14 entries max 64-bit
8 bits indicating number of children
Array of 8-bit values followed by array of associated pointers.
32-bit: 1 byte + 12 bytes + 3 bytes pad + 12*4bytes = 64 bytes
64-bit: 1 byte + 14 bytes + 1 byte pad + 14*8 = 128 bytes

Filled at 25 entries max 32-bit, 28 entries max 64-bit
8 bits indicating number of children
Array of 8-bit values followed by array of associated pointers.
32-bit: 1 byte + 25 bytes + 2 bytes pad + 25*4bytes = 128 bytes
64-bit: 1 byte + 28 bytes + 3 bytes pad + 28*8 = 256 bytes

---> up to this point, on both 32-bit and 64-bit, the sequential lookup
in values array fits in a 32-byte cache line.
  - lookup failure: 1 cache line.
  - lookup success: 2 cache lines.

The two below are listed for completeness sake, but because they require
2 32-byte cache lines for lookup, these are deemed inappropriate.

Filled at 51 entries max 32-bit, 56 entries max 64-bit
8 bits indicating number of children
Array of 8-bit values followed by array of associated pointers.
32-bit: 1 byte + 51 bytes + 51*4bytes = 256 bytes
64-bit: 1 byte + 56 bytes + 7 bytes pad + 56*8 = 512 bytes

Filled at 102 entries max 32-bit, 113 entries max 64-bit
8 bits indicating number of children
Array of 8-bit values followed by array of associated pointers.
32-bit: 1 byte + 102 bytes + 1 byte pad + 102*4bytes = 512 bytes
64-bit: 1 byte + 113 bytes + 6 bytes pad + 113*8 = 1024 bytes


- Type B: bitmap, followed by pointers array.

bitmask (256 entries -> 256 bits -> 32 bytes) of populated children
followed by an array of children pointers, in same order as appears in
the bitmask

+ Allows lookup failure to use 32-byte cache-line only. (1 cacheline)
+ Allows lookup success to use 32-byte cache-line (bitmap),
  followed by direct lookup into pointer array. (2 cachelines)

Filled at 8 entries 32-bit, 12 entries 64-bit
32-bit: 32 + (4*8) ->  64 bytes
64-bit: 32 + (8*12) ->  128 bytes

Filled at 24 entries 32-bit, 28 entries 64-bit
32-bit: 32 + (4*24) ->  128 bytes
64-bit: 32 + (8*28) ->  256 bytes

Filled at 56 entries 32-bit, 60 entries 64-bit
32-bit: 32 + (4*56) ->  256 bytes
64-bit: 32 + (8*60) ->  512 bytes

Filled at 120 entries 32-bit, 124 entries 64-bit
32-bit: 32 + (4*95) ->  512 bytes
64-bit: 32 + (8*124) ->  1024 bytes


- Type C: pigeon-hole array

Filled at 47.2%/48.8% or more (32-bit: 121 entries+, 64-bit: 125 entries+)
Array of children node pointers. Pointers NULL if no child at index.
32-bit: 4*256 = 1024 bytes
64-bit: 8*256 = 2048 bytes


* Analysis of the thresholds:

Analysis of number of cache-lines touched for each node, per-node-type,
depending on the number of children per node, as we increment the number
of children from 0 to 256. Through this, we choose number of children
thresholds at which it is worthwhile to use a different node type.

Per node:

- ALWAYS 1 cache line hit for lookup failure (all cases)

32-bit

- Unexisting

0 children

- Type A: sequential search in value and pointer arrays
- 1 cache line hit for lookup success
- 32 bytes storage

up to 6 children

- 2 cache line hit for lookup success
- 64 bytes storage

up to 12 children

- Type B: bitmap, followed by pointers array.
- 2 cache line hit for lookup success
- 128 bytes storage

up to 24 children

- 256 bytes storage
up to 56 children

- 512 bytes storage
up to 120 children

- Type C: pigeon-hole array
- 1 cache line hit for lookup success
- 1024 bytes storage

up to 256 children


64-bit

- Unexisting

0 children

- Type A: sequential search in value and pointer arrays
- 1 cache line hit for lookup success
- 32 bytes storage

up to 3 children

- 2 cache line hit for lookup success
- 64 bytes storage

up to 7 children

- 128 bytes storage

up to 14 children

- Type B: bitmap, followed by pointers array.
- 2 cache line hit for lookup success
- 256 bytes storage

up to 28 children

- 512 bytes storage
up to 60 children

- 1024 bytes storage
up to 124 children

- Type C: pigeon-hole array
- 1 cache line hit for lookup success
- 2048 bytes storage

up to 256 children


* Analysis of node type encoding and node pointers:

Lookups are _always_ from the top of the tree going down. This
facilitates RCU replacement as we only keep track of pointers going
downward.

Type of node encoded in the parent's pointer. Need to reserve 2
least-significant bits.

Types of children:

enum child_type {
	LINEAR = 0,	/* Type A */
			/* 32-bit: 1 to 12 children, 32 to 64 bytes */
			/* 64-bit: 1 to 14 children, 32 to 128 bytes */
	BITMAP = 1,	/* Type B */
			/* 32-bit: 13 to 120 children, 128 to 512 bytes */
			/* 64-bit: 15 to 124 children, 256 to 1024 bytes */
	PIGEON = 2,	/* Type C */
			/* 32-bit: 121 to 256 children, 1024 bytes */
			/* 64-bit: 125 to 256 children, 2048 bytes */
	LEAF   = 3,	/* Leaf node */
};

If entire pointer is NULL, children is empty.


* Lookup and Update Algorithms

Let's propose a quite simple scheme that uses a mutex on nodes to manage
update concurrency. It's certainly not optimal in terms of concurrency
management within a node, but it has the advantage of being simple to
implement and understand.

We need to keep a count of the number of children nodes (for each node),
to keep track of when the node type thresholds are reached. It would be
important to put an hysteresis loop so we don't change between node
types too often for a loop on add/removal of the same node.

We acquire locks from child to parent, nested. We take all locks
required to perform a given update in the tree (but no more) to keep it
consistent with respect to number of children per node.

If check for node being gc'd (always under node lock) fails, we simply
need to release the lock and lookup the node again.


- Leaf lookup

rcu_read_lock()

RCU-lookup each level of the tree. If level is not populated, fail.
Until we reach the leaf node.

rcu_read_unlock()


- Leaf insertion

A) Lookup

rcu_read_lock()
RCU-lookup insert position. Find location in tree where nodes are
missing for this insertion. If leaf is already present, insert fails,
releasing the rcu read lock.  The insert location consists of a parent
node to which we want to attach a new node.

B) Lock

RCU-lookup parent node. Take the parent lock. If the parent needs to be
reallocated to make room for this insertion, RCU-lookup parent-parent
node and take the parent-parent lock.  For each lock taken, check if
node is being gc'd. If gc'd, release lock, re-RCU-lookup this node, and
retry.

C) Create

Construct the whole branch from the new topmost intermediate node down
to the new leaf node we are inserting. 

D) Populate:
  - If parent node reallocation is needed:
     Reallocate the parent node, adding the new branch to it, and
     increment its node count.
     set gc flag in old nodes.
     call_rcu free for all old nodes.
     Populate new parent node with rcu_assign_pointer.
  - Else:
    Increment parent node count.
    Use rcu_assign_pointer to populate this new branch into the parent
    node.

E) Locks

Release parent and (if taken) parent-parent locks.
rcu_read_unlock()


- Leaf removal

A) Lookup

rcu_read_lock()
RCU-lookup leaf to remove. If leaf is missing, fail and release rcu
read lock.

B) Lock

RCU-lookup parent. Take the parent lock. If the parent needs to be
reallocated because it would be too large for the decremented number of
children, RCU-lookup parent-parent and take the parent-parent lock. Do
so recursively until no node reallocation is needed, or until root is
reached.

For each lock taken, check if node is being gc'd. If gc'd, release lock,
re-RCU-lookup this node, and retry.

C) Create

The branch (or portion of branch) consisting of taken locks necessarily
has a simple node removal or update as operation to do on its top node.

If the operation is a node removal, then, necessarily, the entire branch
under the node removal operation will simply disappear. No node
allocation is needed.

Else, if the operation is a child node reallocation, the child node will
necessarily do a node removal. So _its_ entire child branch will
disappear. So reallocate this child node without the removed branch
(remember to decrement its nr children count).

D) Populate

No reallocation case: simply set the appropriate child pointer in the
topmost locked node to NULL. Decrement its nr children count.

Reallocation case: set the child pointer in the topmost locked node to
the newly allocated node.
set old nodes gc flag.
call_rcu free for all old nodes.

E) Locks

Release all locks.
rcu_read_unlock()


For the various types of nodes:

- sequential search (type A)
  - RCU replacement: mutex
  - Entry update: mutex

- bitmap followed by pointer array (type B)
  - RCU replacement: mutex
  - Entry update: mutex

- pigeon hole array (type C)
  - RCU replacement: mutex
  - Entry update: mutex
Commit	Line	Data
61009379 MD	1	RCU Judy Array Design
	2	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
	3	March 8, 2012
	4
	5	Initial ideas based on the released Judy Shop Manual
	6	(http://judy.sourceforge.net/). Judy was invented by Doug Baskins and
	7	implemented by Hewlett-Packard.
	8
	9	Thresholds and RCU-specific analysis is introduced in this document.
	10
	11	Advantages of using Judy Array (compressed nodes) for RCU tree:
	12	- no rebalancing
	13	- no transplant
	14	- RCU-friendly!
	15	- favor cache-line alignment of structures
	16
	17	Disadvantage:
	18	- updates that need to reallocate nodes are slower than, e.g. non-rcu
	19	red-black trees.
	20
	21	Choice: Using 256 entries intermediate nodes (index can be represented
	22	on 8 bits): 4 levels on 32-bit, 8 levels on 64-bit
	23
	24
	25	* Node types (from less dense node to most dense)
	26
	27
	28	- empty node:
	29
	30	Parent pointer is NULL.
	31
	32
	33	- Type A: sequential search in value and pointer arrays
	34
	35	+ Add/removal just needs to update value and pointer array, single-entry
	36	(non-RCU...). For RCU, we might need to update the entire node anyway.
	37	- Requires sequential search through all value array for lookup fail
	38	test.
	39
	40	Filled at 3 entries max 64-bit
	41	8 bits indicating number of children
	42	Array of 8-bit values followed by array of associated pointers.
	43	64-bit: 1 byte + 3 bytes + 4 bytes pad + 3*8 = 32 bytes
	44
	45	-> up to this point on 64-bit, sequential lookup and pointer read fit in
	46	a 32-byte cache line.
	47	- lookup fail&success: 1 cache-line.
	48
	49	Filled at 6 entries max 32-bit, 7 entries max 64-bit
	50	8 bits indicating number of children
	51	Array of 8-bit values followed by array of associated pointers.
	52	32-bit: 1 byte + 6 bytes + 1 byte pad + 6*4bytes = 32 bytes
	53	64-bit: 1 byte + 7 bytes + 7*8 = 64 bytes
	54
	55	-> up to this point on 32-bit, sequential lookup and pointer read fit in
	56	a 32-byte cache line.
	57	- lookup fail&success: 1 cache-line.
	58
	59	Filled at 12 entries max 32-bit, 14 entries max 64-bit
	60	8 bits indicating number of children
	61	Array of 8-bit values followed by array of associated pointers.
	62	32-bit: 1 byte + 12 bytes + 3 bytes pad + 12*4bytes = 64 bytes
	63	64-bit: 1 byte + 14 bytes + 1 byte pad + 14*8 = 128 bytes
	64
65	Filled at 25 entries max 32-bit, 28 entries max 64-bit
66	8 bits indicating number of children
67	Array of 8-bit values followed by array of associated pointers.
68	32-bit: 1 byte + 25 bytes + 2 bytes pad + 25*4bytes = 128 bytes
69	64-bit: 1 byte + 28 bytes + 3 bytes pad + 28*8 = 256 bytes
70
71	---> up to this point, on both 32-bit and 64-bit, the sequential lookup
72	in values array fits in a 32-byte cache line.
73	- lookup failure: 1 cache line.
74	- lookup success: 2 cache lines.
75
76	The two below are listed for completeness sake, but because they require
77	2 32-byte cache lines for lookup, these are deemed inappropriate.
78
79	Filled at 51 entries max 32-bit, 56 entries max 64-bit
80	8 bits indicating number of children
81	Array of 8-bit values followed by array of associated pointers.
82	32-bit: 1 byte + 51 bytes + 51*4bytes = 256 bytes
83	64-bit: 1 byte + 56 bytes + 7 bytes pad + 56*8 = 512 bytes
84
85	Filled at 102 entries max 32-bit, 113 entries max 64-bit
86	8 bits indicating number of children
87	Array of 8-bit values followed by array of associated pointers.
88	32-bit: 1 byte + 102 bytes + 1 byte pad + 102*4bytes = 512 bytes
89	64-bit: 1 byte + 113 bytes + 6 bytes pad + 113*8 = 1024 bytes
90
91
92	- Type B: bitmap, followed by pointers array.
93
94	bitmask (256 entries -> 256 bits -> 32 bytes) of populated children
95	followed by an array of children pointers, in same order as appears in
96	the bitmask
97
98	+ Allows lookup failure to use 32-byte cache-line only. (1 cacheline)
99	+ Allows lookup success to use 32-byte cache-line (bitmap),
100	followed by direct lookup into pointer array. (2 cachelines)
101
102	Filled at 8 entries 32-bit, 12 entries 64-bit
103	32-bit: 32 + (4*8) -> 64 bytes
104	64-bit: 32 + (8*12) -> 128 bytes
105
106	Filled at 24 entries 32-bit, 28 entries 64-bit
107	32-bit: 32 + (4*24) -> 128 bytes
108	64-bit: 32 + (8*28) -> 256 bytes
109
110	Filled at 56 entries 32-bit, 60 entries 64-bit
111	32-bit: 32 + (4*56) -> 256 bytes
112	64-bit: 32 + (8*60) -> 512 bytes
113
114	Filled at 120 entries 32-bit, 124 entries 64-bit
115	32-bit: 32 + (4*95) -> 512 bytes
116	64-bit: 32 + (8*124) -> 1024 bytes
117
118
119	- Type C: pigeon-hole array
120
121	Filled at 47.2%/48.8% or more (32-bit: 121 entries+, 64-bit: 125 entries+)
122	Array of children node pointers. Pointers NULL if no child at index.
123	32-bit: 4*256 = 1024 bytes
124	64-bit: 8*256 = 2048 bytes
125
126
127	* Analysis of the thresholds:
128
129	Analysis of number of cache-lines touched for each node, per-node-type,
130	depending on the number of children per node, as we increment the number
131	of children from 0 to 256. Through this, we choose number of children
132	thresholds at which it is worthwhile to use a different node type.
133
134	Per node:
135
136	- ALWAYS 1 cache line hit for lookup failure (all cases)
137
138	32-bit
139
140	- Unexisting
141
142	0 children
143
144	- Type A: sequential search in value and pointer arrays
145	- 1 cache line hit for lookup success
146	- 32 bytes storage
147
148	up to 6 children
149
150	- 2 cache line hit for lookup success
151	- 64 bytes storage
152
153	up to 12 children
154
155	- Type B: bitmap, followed by pointers array.
156	- 2 cache line hit for lookup success
157	- 128 bytes storage
158
159	up to 24 children
160
161	- 256 bytes storage
162	up to 56 children
163
164	- 512 bytes storage
165	up to 120 children
166
167	- Type C: pigeon-hole array
168	- 1 cache line hit for lookup success
169	- 1024 bytes storage
170
171	up to 256 children
172
173
174	64-bit
175
176	- Unexisting
177
178	0 children
179
180	- Type A: sequential search in value and pointer arrays
181	- 1 cache line hit for lookup success
182	- 32 bytes storage
183
184	up to 3 children
185
186	- 2 cache line hit for lookup success
187	- 64 bytes storage
188
189	up to 7 children
190
191	- 128 bytes storage
192
193	up to 14 children
194
195	- Type B: bitmap, followed by pointers array.
196	- 2 cache line hit for lookup success
197	- 256 bytes storage
198
199	up to 28 children
200
201	- 512 bytes storage
202	up to 60 children
203
204	- 1024 bytes storage
205	up to 124 children
206
207	- Type C: pigeon-hole array
208	- 1 cache line hit for lookup success
209	- 2048 bytes storage
210
211	up to 256 children
212
213
214	* Analysis of node type encoding and node pointers:
215
216	Lookups are _always_ from the top of the tree going down. This
217	facilitates RCU replacement as we only keep track of pointers going
218	downward.
219
220	Type of node encoded in the parent's pointer. Need to reserve 2
221	least-significant bits.
222
223	Types of children:
224
225	enum child_type {
226	LINEAR = 0, /* Type A */
227	/* 32-bit: 1 to 12 children, 32 to 64 bytes */
228	/* 64-bit: 1 to 14 children, 32 to 128 bytes */
229	BITMAP = 1, /* Type B */
230	/* 32-bit: 13 to 120 children, 128 to 512 bytes */
231	/* 64-bit: 15 to 124 children, 256 to 1024 bytes */
232	PIGEON = 2, /* Type C */
233	/* 32-bit: 121 to 256 children, 1024 bytes */
234	/* 64-bit: 125 to 256 children, 2048 bytes */
235	LEAF = 3, /* Leaf node */
236	};
237
238	If entire pointer is NULL, children is empty.
239
240
241	* Lookup and Update Algorithms
242
243	Let's propose a quite simple scheme that uses a mutex on nodes to manage
244	update concurrency. It's certainly not optimal in terms of concurrency
245	management within a node, but it has the advantage of being simple to
246	implement and understand.
247
248	We need to keep a count of the number of children nodes (for each node),
249	to keep track of when the node type thresholds are reached. It would be
250	important to put an hysteresis loop so we don't change between node
251	types too often for a loop on add/removal of the same node.
252
253	We acquire locks from child to parent, nested. We take all locks
254	required to perform a given update in the tree (but no more) to keep it
255	consistent with respect to number of children per node.
256
257	If check for node being gc'd (always under node lock) fails, we simply
258	need to release the lock and lookup the node again.
259
260
261	- Leaf lookup
262
263	rcu_read_lock()
264
265	RCU-lookup each level of the tree. If level is not populated, fail.
266	Until we reach the leaf node.
267
268	rcu_read_unlock()
269
270
271	- Leaf insertion
272
273	A) Lookup
274
275	rcu_read_lock()
276	RCU-lookup insert position. Find location in tree where nodes are
277	missing for this insertion. If leaf is already present, insert fails,
278	releasing the rcu read lock. The insert location consists of a parent
279	node to which we want to attach a new node.
280
281	B) Lock
282
283	RCU-lookup parent node. Take the parent lock. If the parent needs to be
284	reallocated to make room for this insertion, RCU-lookup parent-parent
285	node and take the parent-parent lock. For each lock taken, check if
286	node is being gc'd. If gc'd, release lock, re-RCU-lookup this node, and
287	retry.
288
289	C) Create
290
291	Construct the whole branch from the new topmost intermediate node down
292	to the new leaf node we are inserting.
293
294	D) Populate:
295	- If parent node reallocation is needed:
296	Reallocate the parent node, adding the new branch to it, and
297	increment its node count.
298	set gc flag in old nodes.
299	call_rcu free for all old nodes.
300	Populate new parent node with rcu_assign_pointer.
301	- Else:
302	Increment parent node count.
303	Use rcu_assign_pointer to populate this new branch into the parent
304	node.
305
306	E) Locks
307
308	Release parent and (if taken) parent-parent locks.
309	rcu_read_unlock()
310
311
312	- Leaf removal
313
314	A) Lookup
315
316	rcu_read_lock()
317	RCU-lookup leaf to remove. If leaf is missing, fail and release rcu
318	read lock.
319
320	B) Lock
321
322	RCU-lookup parent. Take the parent lock. If the parent needs to be
323	reallocated because it would be too large for the decremented number of
324	children, RCU-lookup parent-parent and take the parent-parent lock. Do
325	so recursively until no node reallocation is needed, or until root is
326	reached.
327
328	For each lock taken, check if node is being gc'd. If gc'd, release lock,
329	re-RCU-lookup this node, and retry.
330
331	C) Create
332
333	The branch (or portion of branch) consisting of taken locks necessarily
334	has a simple node removal or update as operation to do on its top node.
335
336	If the operation is a node removal, then, necessarily, the entire branch
337	under the node removal operation will simply disappear. No node
338	allocation is needed.
339
340	Else, if the operation is a child node reallocation, the child node will
341	necessarily do a node removal. So _its_ entire child branch will
342	disappear. So reallocate this child node without the removed branch
343	(remember to decrement its nr children count).
344
345	D) Populate
346
347	No reallocation case: simply set the appropriate child pointer in the
348	topmost locked node to NULL. Decrement its nr children count.
349
350	Reallocation case: set the child pointer in the topmost locked node to
351	the newly allocated node.
352	set old nodes gc flag.
353	call_rcu free for all old nodes.
354
355	E) Locks
356
357	Release all locks.
358	rcu_read_unlock()
359
360
361	For the various types of nodes:
362
363	- sequential search (type A)
364	- RCU replacement: mutex
365	- Entry update: mutex
366
367	- bitmap followed by pointer array (type B)
368	- RCU replacement: mutex
369	- Entry update: mutex
370
371	- pigeon hole array (type C)
372	- RCU replacement: mutex
373	- Entry update: mutex