From: Jonathan Rajotte Date: Wed, 23 Jun 2021 02:17:03 +0000 (-0400) Subject: Fix: ust: UST communication can return -EAGAIN X-Git-Tag: v2.11.8~3 X-Git-Url: https://git.liburcu.org/?a=commitdiff_plain;h=ee3f5169cd70c5b930a195427a77a39380af5af8;hp=ee3f5169cd70c5b930a195427a77a39380af5af8;p=lttng-tools.git Fix: ust: UST communication can return -EAGAIN Observed issue ============== The following scenario lead to an abort on event creation. The problem manifest itself when an application is unresponsive. Note that the default timeout for ust communication is 5 seconds. # Start an instrumented app ./app gdb lttng-sessiond # put a breakpoint on ustctl_create_event. lttng create my_session lttng enable-event -u -a lttng start # The tracepoint should hit. Do not continue. kill -s SIGSTOP $(pgrep app) # Continue lttng-sessiond. # lttng-sessiond will abort. Note that for UST this is not an expected behaviour. Expected communication failure with a single app should not invalidate the complete channel, compromise its setup or result in an abort. Note that a similar scenario for the following ustctl call sites also lead to scenario where failure of a single app lead to error reporting and/or error propagation to upper level object. Problematic callsites: ustctl_set_exclusion ustctl_set_filter ustctl_disable_channel These callsites are also fixed by this patch. Cause ===== For an unresponsive application, EAGAIN is returned and is treated as an "unknown" hard error. In this particular case the abort() call was introduced by commit: 88e3c2f5610b9ac89b0923d448fee34140fc46fb [1]. It is not clear if this is a leftover from debugging session since this is the only callsite where an abort is issued on communication failure via ustctl. Solution ======== Handle EAGAIN coming from ustctl_* and treat it the same way a dying application is handled. The only minor difference is that we WARN on communication time out. Albeit not the most useful thing for a CLI client, it could help overall user of lttng-sessiond in time out situation. Most call site already handled "unknown" error correctly. For those call site we simply end up bringing more info in regards to the timeout issue instead of mentioning that "-11" was returned. Note, the reclamation of "app" is handled by the poll loop and ust_app_unregister since the socket is shutdown by lttng-ust internally on error, including EAGAIN. Note that the application will try to register itself back to the lttng-sessiond based on its configuration. Known drawbacks ========= None References ========== [1] https://github.com/lttng/lttng-tools/commit/88e3c2f5610b9ac89b0923d448fee34140fc46fb Fixes: #1384 Signed-off-by: Jonathan Rajotte Change-Id: If364b5d48e7fd2b664276a0fb1b7eec2c45ed683 Signed-off-by: Jonathan Rajotte Signed-off-by: Jérémie Galarneau ---