Fix: abort on decrement_sem_count during concurrent tracing start and teardown
authorJonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Mon, 6 Dec 2021 14:20:58 +0000 (09:20 -0500)
committerMathieu Desnoyers <mathieu.desnoyers@efficios.com>
Thu, 9 Dec 2021 16:29:17 +0000 (11:29 -0500)
commit4d03f5699ca5129319b8ceb7baef85bbf05f3f3e
tree6d35b6ee509f8d78e8ba1994d0ed0e4f2dfeba96
parent7850c5cc9c0ff172bff3798c55f3d17bebf7c6ce
Fix: abort on decrement_sem_count during concurrent tracing start and teardown

Observed issue
==============

The following backtrace has been reported:

 #0  __GI_raise (sig=sig@entry=6)
     at /usr/src/debug/glibc/2.31/git/sysdeps/unix/sysv/linux/raise.c:50
 #1  0x0000007f90b3fdd4 in __GI_abort () at /usr/src/debug/glibc/2.31/git/stdlib/abort.c:79
 #2  0x0000007f90b4bf50 in __assert_fail_base (fmt=0x7f90c3da98 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
     assertion=assertion@entry=0x7f9112cb90 "uatomic_read(&sem_count) >= count",
     file=file@entry=0x7f9112cb30 "/usr/src/debug/lttng-ust/2_2.13.0-r0/lttng-ust-2.13.0/src/lib/lttng-ust/lttng-ust-comm.c",
 line=line@entry=664, function=function@entry=0x7f911317e8 <__PRETTY_FUNCTION__.10404> "decrement_sem_count")
     at /usr/src/debug/glibc/2.31/git/assert/assert.c:92
 #3  0x0000007f90b4bfb4 in __GI___assert_fail (assertion=assertion@entry=0x7f9112cb90 "uatomic_read(&sem_count) >= count",
     file=file@entry=0x7f9112cb30 "/usr/src/debug/lttng-ust/2_2.13.0-r0/lttng-ust-2.13.0/src/lib/lttng-ust/lttng-ust-comm.c",
 line=line@entry=664, function=function@entry=0x7f911317e8 <__PRETTY_FUNCTION__.10404> "decrement_sem_count")
     at /usr/src/debug/glibc/2.31/git/assert/assert.c:101
 #4  0x0000007f910e3830 in decrement_sem_count (count=<optimized out>)
     at /usr/src/debug/lttng-ust/2_2.13.0-r0/lttng-ust-2.13.0/src/lib/lttng-ust/lttng-ust-comm.c:664
 #5  0x0000007f910e5d28 in handle_pending_statedump (sock_info=0x7f9115c608 <global_apps>)
     at /usr/src/debug/lttng-ust/2_2.13.0-r0/lttng-ust-2.13.0/src/lib/lttng-ust/lttng-ust-comm.c:737
 #6  handle_message (lum=0x7f8dde46d8, sock=3, sock_info=0x7f9115c608 <global_apps>)
     at /usr/src/debug/lttng-ust/2_2.13.0-r0/lttng-ust-2.13.0/src/lib/lttng-ust/lttng-ust-comm.c:1410
 #7  ust_listener_thread (arg=0x7f9115c608 <global_apps>)
     at /usr/src/debug/lttng-ust/2_2.13.0-r0/lttng-ust-2.13.0/src/lib/lttng-ust/lttng-ust-comm.c:2055
 #8  0x0000007f90af73e0 in start_thread (arg=0x7fc27a82f6)
     at /usr/src/debug/glibc/2.31/git/nptl/pthread_create.c:477
 #9  0x0000007f90bead5c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

It turns out that the main thread is at that point iterating over the
libraries destructors:

Thread 3 (LWP 1983):
 #0  0x0000007f92a68a0c in _dl_fixup (l=0x7f9054e510, reloc_arg=432)
     at /usr/src/debug/glibc/2.31/git/elf/dl-runtime.c:69
 #1  0x0000007f92a6ea3c in _dl_runtime_resolve () at ../sysdeps/aarch64/dl-trampoline.S:100
 #2  0x0000007f905170f8 in __do_global_dtors_aux () from <....>/crash/work/rootfs/usr/lib/libbsd.so.0
 #3  0x0000007f92a697f8 in _dl_fini () at /usr/src/debug/glibc/2.31/git/elf/dl-fini.c:138
 #4  0x0000007f90b54864 in __run_exit_handlers (status=0, listp=0x7f90c65648 <__exit_funcs>,
     run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
     at /usr/src/debug/glibc/2.31/git/stdlib/exit.c:108
 #5  0x0000007f90b549f4 in __GI_exit (status=<optimized out>)
     at /usr/src/debug/glibc/2.31/git/stdlib/exit.c:139
 #6  0x0000000000404c98 in a_function_name (....) at main.c:152
 #7  0x0000000000404a98 in main (argc=3, argv=0x7fc27a8858, env=0x7fc27a8878) at main.c:97

Cause
=====

An enable command is processed at the same time that the lttng-ust
destructor is run. At the end of the command handling,
`handle_pending_statedump` is called. Multiple variables from the
`sock_info` struct are checked outside the UST lock at that point.

lttng-ust-comm.c +1406:
   /*
    * Performed delayed statedump operations outside of the UST
    * lock. We need to take the dynamic loader lock before we take
    * the UST lock internally within handle_pending_statedump().
     */
   handle_pending_statedump(sock_info);

Namely:
   registration_done
   statedump_pending
   initial_statedump_done

`statedump_pending` is set during the enable command
(`lttng_session_statedump`, lttng-events.c +631) in the same thread.

As for `registration_done` and `initial_statedump_done` they are invariant
from the registration of the app until `lttng_ust_cleanup` is called.
`cleanup_sock_info` called by `lttng_ust_cleanup`, itself called by
`lttng_ust_exit` resets the `registration_done` and
`initial_statedump_done` fields. Note that these operations are done
outside of the listener thread.

Note that by that point `lttng_ust_exit` expects all "getters" on
`sock_info` to fail while trying to acquire the UST lock due to
`lttng_ust_comm_should_quit` set to 1. Note that the listener threads
can still exist because we do not join them, we only execute
pthread_cancel which is async.

Clearly we are missing mutual exclusion provided by locking
when accessing `registration_done` and `initial_statedump_done`.

Solution
========

Here we can do better and simply not require any mutual exclusion based on locking.

`registration_done` and `initial_statedump_done` only need to be reset
to zero when we are not actually exiting (`lttng_ust_after_fork_child`).
In this case, no concurrent listener thread exists at that point
that could access those fields during the reset. Hence we can move the
reset to only the non-exiting code path and alleviate the current
situation.

Known drawbacks
===============

None.

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: I45ba3eaee20c49a3988837a87fa680ce0a6ed953
src/lib/lttng-ust/lttng-ust-comm.c
This page took 0.026188 seconds and 4 git commands to generate.