From: Jérémie Galarneau Date: Sat, 17 Feb 2024 13:57:47 +0000 (-0500) Subject: Fix: relayd: live: dispose of zombie viewer metadata stream X-Git-Url: https://git.liburcu.org/?a=commitdiff_plain;h=55d3f5e528059267f44fb24b179b992bf2b0cfa6;hp=55d3f5e528059267f44fb24b179b992bf2b0cfa6;p=lttng-tools.git Fix: relayd: live: dispose of zombie viewer metadata stream Issue observed ============== In the CI, builds on SLES15SP5 frequently experience timeouts. From prior inspections, there are hangs during tests/regression/tools/clear/test_ust while waiting for babeltrace to exit. It is possible to reproduce the problem fairly easily: $ lttng create --live $ lttng enable-event --userspace --all $ lttng start # Launch an application that emits a couple of events $ ./my_app $ lttng stop # Clear the data, this eventually results in the deletion of all # trace files on the relay daemon's end. $ lttng clear # Attach to the live session from another terminal $ babeltrace -i lttng-live net://... # The 'destroy' command completes, but the viewer never exits. $ lttng destroy Cause ===== After the clear command completes, the relay daemon no longer has any data to serve. We notice that the live client loops endlessly repeatably sending GET_METADATA requests. In response, the relay daemon replies with the NO_NEW_METADATA status. In concrete terms, the viewer_get_metadata() function short-circuits to send that reply when it sees that the metadata stream has no active trace chunk (i.e., there are no backing files from which to read the data at the moment). This situation is not abnormal in itself: it is legitimate for a client to wait for the metadata to become available again. For example, in the reproducer above, it would be possible for the user to restart the tracing (lttng start), which would create a new trace chunk and make the metadata stream available. New events could also be emitted following this restart. However, when a session's connection is closed, there is no hope that the metadata stream will ever transition back to an active trace chunk. Solution ======== When the metadata stream has no active chunk and the corresponding consumerd-side connection has been closed, there is no way the relay daemon will be able to serve the metadata contents to the client. As such, the viewer stream can be disposed-of since it will no longer be of any use to the client. Since some client implementations expect at least one GET_METADATA command to result in NO_NEW_METADATA, that status code is initially returned. Later, when the client emits a follow-up GET_METADATA request for that same stream, it will receive an "error" status indicating that the stream no longer exists. This situation is not treated as an error by the clients. For instance, babeltrace2 will simply close the corresponding trace and indicate it ended. The 'no_new_metadata_notified' flag doesn't appear to be necessary to implement the behaviour expected by the clients (seeing at least one NO_NEW_METADATA status reply for every metadata stream). The viewer_get_metadata() function is refactored a bit to drop the global reference to the viewer metadata stream as it exits, while still returning the NO_NEW_METADATA status code. Known drawbacks =============== None. Note ==== The commit message of e8b269fa provides more details behind the intention of the 'no_new_metadata_notified' flag. Change-Id: Ib1b80148d7f214f7aed221d3559e479b69aedd82 Signed-off-by: Jérémie Galarneau ---