urcu tests: hold mutex across use of custom allocator
A thread preempted for a long period of time could race, when scheduled
again, with another thread that would have been allocating/freeing
entries (thus wrapping-around the available buffer), which would trigger
this race only when overcommitting the number of threads compared to the
number of available CPUs.
Taking the mutex across alloc and free to fix this.
mremap keeps the same virtual pages for the old/new mappings. So
explicitly copying from the old mapping is not needed, and probably
buggy, since the old mapping might have been unmapped.
call_rcu: per_cpu_call_rcu_data should be non-const
On FreeBSD:
In file included from urcu.c:438:
urcu-call-rcu-impl.h: In function 'get_cpu_call_rcu_data_mb':
urcu-call-rcu-impl.h:325: warning: return discards qualifiers from pointer target type
compat_arch_x86.c: In function '_compat_uatomic_set':
compat_arch_x86.c:104:16: warning: variable 'result' set but not used [-Wunused-but-set-variable]
* Incorrect prototype for uatomic_and and uatomic_or in i386
compatibility code.
* Missing $(COMPAT) code inclusion in wfq/lfq tests.
* Silence gcc warnings about compat code (branch volountarily causing a
linker error, which can never return).
./urcu/uatomic/generic.h: In function '_uatomic_and':
./urcu/uatomic/generic.h:310:2: warning: 'return' with a value, in
function returning void
./urcu/uatomic/generic.h: In function '_uatomic_or':
./urcu/uatomic/generic.h:374:2: warning: 'return' with a value, in
function returning void
Even though int is 32-bit on all architectures supported by liburcu so
far, make it future-proof by uint a int32_t, which enforces the same
type width used by the system call in the kernel.
Using int32_t and not uint32_t to make comparison with 0 more
straightforward.
Reported-by: Darren Hart <dvhart@linux.intel.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Yannick Brosseau [Fri, 10 Jun 2011 15:35:49 +0000 (11:35 -0400)]
Add library version information
Following the guidelines from libtool
(http://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.htm)
this patch add version information to the distributed libraries.
For the next release, the version will be 1:0:0.
It will need to be updated before each release.
Paolo Bonzini [Thu, 9 Jun 2011 16:54:38 +0000 (12:54 -0400)]
arm: remove useless declarations
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
commit bc94ca9bada25f7403e3e859caa241146ae8e338 changed the !RT behavior
slightly: when the list is not empty, it does not wait for a delay
anymore. Add this delay back, to ensure we don't flood the system with
frequent synchronize_rcu() calls, which would slow down readers.
The wait scheme has an implementation problem: if the list is not empty
when the !RT scheme checks for it, it will restart the loop and
decrement the futex (again) without calling call_rcu_wait() (which would
wait until it is set back to 0). So in this case, we can end up
decrementing "futex" to values well below -1.
Fix this by moving the decrement before the loop, and duplicate it after
return from call_rcu_wait() + poll() delay. Also move the "set futex to
0 upon stopping" outside of the loop: this is the only way the loop can
be stopped anyway.
Paolo Bonzini [Thu, 9 Jun 2011 14:13:13 +0000 (10:13 -0400)]
use generic-size macros for common implementation of atomic ops
The definition of _uatomic_cmpxchg is different in x86 and other
architectures. For x86 it is a 4-argument macro, for other
architectures it is a 3-argument function. This patch makes it easier
to implement atomic operations incrementally (first as a generic version
and then in machine-specific code), which aids testing and
bisectability.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Paolo Bonzini [Thu, 9 Jun 2011 13:32:58 +0000 (09:32 -0400)]
call_rcu: drop mutex
The mutex is being used only to protect OR accesses to the flags.
Just use atomic operations for that.
[ Edit: this also fixes busy-looping on flags that were previously read
without volatile access, which could lead to never-ending loop given the
appropriate set of compiler optimisations. ]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
/usr/local/lib/liburcu-bp.so: undefined reference to `cds_wfq_enqueue'
/usr/local/lib/liburcu-bp.so: undefined reference to `cds_wfq_init'
/usr/local/lib/liburcu-bp.so: undefined reference to `cds_wfq_node_init'
Paolo Bonzini [Wed, 8 Jun 2011 22:09:19 +0000 (18:09 -0400)]
rcutorture: make goflag volatile
Even cmm_barrier may not be enough as a compiler barrier in the presence
of static variables, and cmm_mb would not be either! The compiler
assumes that calling a function does not clobber static variables if it
can prove that they do not escape (which also implies the correct phase
of the moon).
Rather than sprinkling the code with compiler barriers, I am taking the
easy way out and declaring the variable volatile. Without this patch,
rcutorture_qsbr never finishes for me (GCC 4.5.0).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
If I remove the URCU_CALL_RCU_RT flag from the rbtree single writer
test, thus using the pthread_cond_signal mechanism, there is a huge
slowdown: without cpu affinity for the worker threads, it crawls to 129
updates/s (looks like mutex contention between the thread calling
call_rcu and the call_rcu thread). Adding CPU affinity to the per-cpu
call_rcu threads, I get 546 updates/s, which is slightly better (better
cache locality, and maybe the mutex contention is not as bad thanks to
the two threads sharing the same CPU).
So I decided to try replacing pthread_cond_wait/signal with my
futex-based implementation I use for the rest of the urcu lib: it has
the advantage of removing the mutex from the call_rcu() execution
entirely, sampling the "futex" variable without any mutex whatsoever for
the case where no wakeup is needed.
Disabling URCU_CALL_RCU_RT flag, with per-cpu affined call_rcu threads,
with my futex-based wakeup implementation, I get 55754 updates/s (even
better than with URCU_CALL_RCU_RT flag!).
call_rcu: use cpu affinity for per-cpu call_rcu threads
I played a bit with the call_rcu() implementation alongside with my
rbtree tests, and noticed the following:
If I use per-cpu call_rcu threads with URCU_CALL_RCU_RT flag, with one
updater thread only for my rbtree (no reader), I get 38365 updates/s.
If I add cpu affinity to these per-cpu call_rcu threads (I have prepared
a patch that does this), it jumps to 54219 updates/s. So it looks like
keeping per-cpu affinity for the call_rcu thread is a good thing.
Map symbols to allow multiple RCU flavors to be used in one binary
Probably need similar mapping for rcu_defer(). Definitely need
backwards-compatibility mapping for programs compiled against
old versions of the library.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Colin McCabe [Wed, 11 May 2011 11:38:29 +0000 (07:38 -0400)]
userspace-rcu tests: zero array before using
It seems like we need the test arrays to start out zeroed.
Without this patch, I get errors like this when running the tests:
test_urcu: test_urcu.c:201: test_array_alloc:
Assertion `test_array[index].a == 0xDEADBEEF || test_array[index].a ==
0' failed.
Signed-off-by: Colin McCabe <cmccabe@alumni.cmu.edu> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Jason Wessel [Wed, 4 May 2011 19:14:14 +0000 (15:14 -0400)]
arm: put CONFIG_RCU_ARM_HAVE_DMB in the global config.h
Change CONFIG_ARM_HAVE_DMB to CONFIG_RCU_ARM_HAVE_DMB while at the
same time making it seen as a global setting in urch/config.h.
The initial ARM < v7 work missed this setting, which allowed the URCU
to package to build correctly, but exposed the memorry barriers
defines incorrectly for other applications using the URCU headers.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Adds call_rcu(), with RCU threads to invoke the callbacks. By default,
there will be one such RCU thread per process, created the first time
that call_rcu() is invoked. On systems supporting sched_getcpu(), it
is possible to create one RCU thread per CPU by calling
create_all_cpu_call_rcu_data().
This version includes feedback from Mathieu Desnoyers.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Jason Wessel [Mon, 28 Feb 2011 16:34:10 +0000 (11:34 -0500)]
Add a test for the dmb instruction on ARM
The dmb instruction is used to implement a memory barrier on modern
ARM hardware. The older generations of processors which are included
in single processor systems can get by on the generic implementation
of the cmm_mb(). This will allow the user space rcu to work on many
other ARM systems.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>