From 03e5118f0fce1e54a0422cf55778c689dbcb5838 Mon Sep 17 00:00:00 2001 From: Mathieu Desnoyers Date: Thu, 11 Oct 2012 12:28:23 -0400 Subject: [PATCH] Fix: call_rcu list corruption on teardown (documentation) This commit is a place-holder to document that commit 5161f31e09ce33dd79afad8d08a2372fbf1c4fbe fixed a list corruption bug in call_rcu. Introducing __cds_wfcq_splice_blocking() fixed a list corruption bug in the 0.7.x series. The equivalent fix appeared in 0.6.8 for the stable-0.6 branch. Description of the bug: * Mathieu Desnoyers wrote: > * Lai Jiangshan (laijs@cn.fujitsu.com) wrote: > > test code: > > ./tests/test_urcu_lfs 100 10 10 > > > > bug produce rate > 60% > > > > {{{ > > I didn't see any bug when "./tests/test_urcu_lfs 10 10 10" Or > +"./tests/test_urcu_lfs 100 100 10" > > But I just test it about 5 times > > }}} > > > > 4cores*1threads: Intel(R) Core(TM) i5 CPU 760 > > RCU_MB (no time to test for other rcu type) > > test commit: 768fba83676f49eb73fd1d8ad452016a84c5ec2a > > > > I didn't see any bug when "./tests/test_urcu_mb 10 100 10" > > > > Sorry, I tried, but I failed to find out the root cause currently. > > I think I managed to narrow down the issue: > > 1) the master branch does not reproduce it, but commit > 768fba83676f49eb73fd1d8ad452016a84c5ec2a repdroduces it about 50% of the > time. > > 2) the main change between 768fba83676f49eb73fd1d8ad452016a84c5ec2a and > current master (f94061a3df4c9eab9ac869a19e4228de54771fcb) is call_rcu > moving to wfcqueue. > > 3) the bug always arise, for me, at the end of the 10 seconds. > However, it might be simply due to the fact that most of the memory > get freed at the end of program execution. > > 4) I've been able to get a backtrace, and it looks like we have some > call_rcu callback-invocation threads still working while > call_rcu_data_free() is invoked. In the backtrace, call_rcu_data_free() > is nicely waiting for the next thread to stop, and during that time, > two callback-invocation threads are invoking callbacks (and one of > them triggers the segfault). > > So I expect that commit > > commit 5161f31e09ce33dd79afad8d08a2372fbf1c4fbe > Author: Mathieu Desnoyers > Date: Tue Sep 25 10:50:49 2012 -0500 > > call_rcu: use wfcqueue, eliminate false-sharing > > Eliminate false-sharing between call_rcu (enqueuer) and worker threads > on the queue head and tail. > > Acked-by: Paul E. McKenney > Signed-off-by: Mathieu Desnoyers > > Could have managed to fix the issue, or change the timing enough that it > does not reproduces. I'll continue investigating. The bug was in call rcu. It is not required for master, because we fixed it while moving to wfcqueue. We were erroneously writing to the head field of the default call_rcu_data rather than tail. The conditions to reproduce this bug: 1) setup per-cpu callback-invocation threads, 2) use call_rcu 3) call call_rcu_data_free() while there are still some pending callbacks that have not yet been executed by the callback-invocation threads, 4) we then get corruption due to the "default" callback invocation that walks through a corrupted queue. Signed-off-by: Mathieu Desnoyers --- urcu-call-rcu-impl.h | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/urcu-call-rcu-impl.h b/urcu-call-rcu-impl.h index 4e5879f..61aeee6 100644 --- a/urcu-call-rcu-impl.h +++ b/urcu-call-rcu-impl.h @@ -626,6 +626,10 @@ void call_rcu(struct rcu_head *head, * The caller must wait for a grace-period to pass between return from * set_cpu_call_rcu_data() and call to call_rcu_data_free() passing the * previous call rcu data as argument. + * + * Note: introducing __cds_wfcq_splice_blocking() in this function fixed + * a list corruption bug in the 0.7.x series. The equivalent fix + * appeared in 0.6.8 for the stable-0.6 branch. */ void call_rcu_data_free(struct call_rcu_data *crdp) { -- 2.34.1