wfcqueue: implement concurrency-efficient queue
This new API simplify the wfqueue implementation, and brings a 2.3x to
2.6x performance boost due to the ability to eliminate false-sharing
between enqueue and dequeue.
This work is derived from the patch from Lai Jiangshan submitted as
"urcu: new wfqueue implementation"
(http://lists.lttng.org/pipermail/lttng-dev/2012-August/018379.html)
Its changelog:
> Some guys would be surprised by this fact:
> There are already TWO implementations of wfqueue in urcu.
>
> The first one is in urcu/static/wfqueue.h:
> 1) enqueue: exchange the tail and then update previous->next
> 2) dequeue: wait for first node's next pointer and them shift, a dummy node
> is introduced to avoid the queue->tail become NULL when shift.
>
> The second one shares some code with the first one, and the left code
> are spreading in urcu-call-rcu-impl.h:
> 1) enqueue: share with the first one
> 2) no dequeue operation: and no shift, so it don't need dummy node,
> Although the dummy node is queued when initialization, but it is removed
> after the first dequeue_all operation in call_rcu_thread().
> call_rcu_data_free() forgets to handle the dummy node if it is not removed.
> 3)dequeue_all: record the old head and tail, and queue->head become the special
> tail node.(atomic record the tail and change the tail).
>
> The second implementation's code are spreading, bad for review, and it is not
> tested by tests/test_urcu_wfq.
>
> So we need a better implementation avoid the dummy node dancing and can service
> both generic wfqueue APIs and dequeue_all API for call rcu.
>
> The new implementation:
> 1) enqueue: share with the first one/original implementation.
> 2) dequeue: shift when node count >= 2, cmpxchg when node count = 1.
> no dummy node, save memory.
> 3) dequeue_all: simply set queue->head.next to NULL, xchg the tail
> and return the old head.next.
>
> More implementation details are in the code.
> tests/test_urcu_wfq will be update in future for testing new APIs.
The patch proposed by Lai brings a very interesting simplification to
the single-node handling (which is kept here), and moves all queue
handling code away from call_rcu implementation, back into the wfqueue
code. This has the benefit to allow testing enhancements.
I modified it so the API does not expose implementation details to the
user (e.g. ___cds_wfq_node_sync_next). I added a "splice" operation and
a for loop iterator which should allow wfqueue users to use the list
very efficiently both from LGPL/GPL code and from non-LGPL-compatible
code.
I also changed the API so the queue head and tail are now two separate
structures: it allows the queue user to place these as they like, either
on different cache lines (to eliminate false-sharing), or close one to
another (on same cache-line) in case a queue is spliced onto the stack
and not concurrently accessed.
Benchmarks performed on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
(dual-core, with hyperthreading)
Benchmark invoked:
for a in $(seq 1 10); do ./test_urcu_wfq 1 1 10 -a 0 -a 2; done
(using cpu number 0 and 2, which should correspond to two cores of my
Intel 2-core/hyperthread processor)
Before patch:
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
97274297 nr_dequeues
80745742 successful enqueues
97274297 successful dequeues
80745321 end_dequeues
16528976 nr_ops
178020039
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
92300568 nr_dequeues
75019529 successful enqueues
92300568 successful dequeues
74973237 end_dequeues
17327331 nr_ops
167320097
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
93516443 nr_dequeues
75846726 successful enqueues
93516443 successful dequeues
75826578 end_dequeues
17689865 nr_ops
169363169
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
94160362 nr_dequeues
77967638 successful enqueues
94160362 successful dequeues
77967638 end_dequeues
16192724 nr_ops
172128000
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
97491956 nr_dequeues
81001191 successful enqueues
97491956 successful dequeues
81000247 end_dequeues
16491709 nr_ops
178493147
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
94101298 nr_dequeues
75650510 successful enqueues
94101298 successful dequeues
75649318 end_dequeues
18451980 nr_ops
169751808
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
94742803 nr_dequeues
75402105 successful enqueues
94742803 successful dequeues
75341859 end_dequeues
19400944 nr_ops
170144908
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
92198835 nr_dequeues
75037877 successful enqueues
92198835 successful dequeues
75027605 end_dequeues
17171230 nr_ops
167236712
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
94159560 nr_dequeues
77895972 successful enqueues
94159560 successful dequeues
77858442 end_dequeues
16301118 nr_ops
172055532
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
96059399 nr_dequeues
80115442 successful enqueues
96059399 successful dequeues
80066843 end_dequeues
15992556 nr_ops
176174841
After patch:
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
221229322 nr_dequeues
210645491 successful enqueues
221229322 successful dequeues
210645088 end_dequeues
10584234 nr_ops
431874813
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
219803943 nr_dequeues
210377337 successful enqueues
219803943 successful dequeues
210368680 end_dequeues
9435263 nr_ops
430181280
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
237006358 nr_dequeues
237035340 successful enqueues
237006358 successful dequeues
236997050 end_dequeues 9308 nr_ops
474041698
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
235822443 nr_dequeues
235815942 successful enqueues
235822443 successful dequeues
235814020 end_dequeues 8423 nr_ops
471638385
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
235825567 nr_dequeues
235811803 successful enqueues
235825567 successful dequeues
235810526 end_dequeues 15041 nr_ops
471637370
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
221974953 nr_dequeues
210938190 successful enqueues
221974953 successful dequeues
210938190 end_dequeues
11036763 nr_ops
432913143
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
237994492 nr_dequeues
237938119 successful enqueues
237994492 successful dequeues
237930648 end_dequeues 63844 nr_ops
475932611
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
220634365 nr_dequeues
210491382 successful enqueues
220634365 successful dequeues
210490995 end_dequeues
10143370 nr_ops
431125747
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
237388065 nr_dequeues
237401251 successful enqueues
237388065 successful dequeues
237380295 end_dequeues 7770 nr_ops
474789316
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
221201436 nr_dequeues
210831162 successful enqueues
221201436 successful dequeues
210831162 end_dequeues
10370274 nr_ops
432032598
Summary: Both enqueue and dequeue speed increase: around 2.3x speedup
for enqueue, and around 2.6x for dequeue.
We can verify that:
successful enqueues - successful dequeues = end_dequeues
For all runs (ensures correctness: no lost node).
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>