ltt/branches/poly/doc/developer/lttng-userspace-tracing.txt

   1
   2 Some thoughts about userspace tracing
   3
   4 Mathieu Desnoyers January 2006
   5
   6
   7
   8 * Goals
   9
  10 Fast and secure user space tracing.
  11
  12 Fast :
  13
  14 - 5000ns for a system call is too long. Writing an event directly to memory
  15         takes 220ns.
  16 - Still, we can afford a system call for buffer switch, which occurs less often.
  17 - No locking, no signal disabling. Disabling signals require 2 system calls.
  18         Mutexes are implemented with a short spin lock, followed by a yield. Yet
  19         another system call. In addition, we have no way to know on which CPU we are
  20         running when in user mode. We can be preempted anywhere.
  21 - No contention.
  22 - No interrupt disabling : it doesn't exist in user mode.
  23
  24 Secure :
  25
  26 - A process shouldn't be able to corrupt the system's trace or another
  27         process'trace. It should be limited to its own memory space.
  28
  29
  30
  31 * Solution
  32
  33 - Signal handler concurrency
  34
  35 Using atomic space reservation in the buffer(s) will remove the requirement for
  36 locking. This is the fast and safe way to deal with concurrency coming from
  37 signal handlers.
  38
  39 - Start/stop tracing
  40
  41 Two possible solutions :
  42
  43 Either we export a read-only memory page from kernel to user space. That would
  44 be somehow seen as a hack, as I have never even seen such interface anywhere
  45 else. It may lead to problems related to exported types. The proper, but slow,
  46 way to do it would be to have a system call that would return the tracing
  47 status.
  48
  49 My suggestion is to go for a system call, but only call it :
  50
  51 - when the process starts
  52 - when receiving a SIG_UPDTRACING
  53
  54 Two possibilities :
  55
  56 - one system call per information to get/one system call to get all information.
  57 - one signal per information to get/one signal for "update" tracing info.
  58
  59 I would tend to adopt :
  60
  61 - One signal for "general tracing update"
  62         One signal handler would clearly be enough, more would be unnecessary
  63         overhead/pollution.
  64 - One system call for all updates.
  65         We will need to have multiple parameters though. We have up to 6 parameters.
  66
  67 syscall get_tracing_info
  68
  69 first parameter : active traces mask (32 bits : 32 traces).
  70
  71
  72 Concurrency
  73
  74 We must have per thread buffers. Then, no memory can be written by two threads
  75 at once. It removes the need for locks (ok, atomic reservation was already doing
  76 that) and removes false sharing.
  77
  78
  79 Multiple traces
  80
  81 By having the number of active traces, we can allocate as much buffers as we
  82 need. The only thing is that the buffers will only be allocated when receiving
  83 the signal/starting the process and getting the number of traces actives.
  84
  85 It means that we must make sure to only update the data structures used by
  86 tracing functions once the buffers are created.
  87
  88 When adding a new buffer, we should call the set_tracing_info syscall and give
  89 the new buffers array to the kernel. It's an array of 32 pointers to user pages.
  90 They will be used by the kernel to get the last pages when the thread dies.
  91
  92 If we remove a trace, the kernel should stop the tracing, and then get the last
  93 buffer for this trace. What is important is to make sure no writers are still
  94 trying to write in a memory region that get desallocated.
  95
  96 For that, we will keep an atomic variable "tracing_level", which tells how many
  97 times we are nested in tracing code (program code/signal handlers) for a
  98 specific trace.
  99
 100 We could do that trace removal in two operations :
 101
 102 - Send an update tracing signal to the process
 103         - the sig handler get the new tracing status, which tells that tracing is
 104                 disabled for the specific trace. It writes this status in the tracing
 105                 control structure of the process.
 106         - If tracing_level is 0, well, it's fine : there are no potential writers in
 107                 the removed trace. It's up to us to buffer switch the removed trace, and,
 108                 after the control returns to us, set_tracing_info this page to NULL and
 109                 delete this memory area.
 110         - Else (tracing_level > 0), flag the removed trace for later switch/delete.
 111
 112         It then returns control to the process.
 113
 114 - If the tracing_level was > 0, there was one or more writers potentially
 115         accessing this memory area. When the control comes back to the writer, at the
 116         end of the write in a trace, if the trace is marked for switch/delete and the
 117         tracing_level is 0 (after the decrement of the writer itself), then the
 118         writer must buffer switch, set_tracing_info to NULL and then delete the
 119         memory area.
 120
 121
 122 Filter
 123
 124 The update tracing info signal will make the thread get the new filter
 125 information. Getting this information will also happen upon process creation.
 126
 127 parameter 2 for the get tracing info : array of 32 ints (32 bits).
 128 Each integer is the filter mask for a trace. As there are up to 32 active
 129 traces, we have 32 integers for filter.
 130
 131
 132 Buffer switch
 133
 134 There could be a tracing_buffer_switch system call, that would give the page
 135 start address as parameter. The job of the kernel is to steal this page,
 136 possibly replacing it with a zeroed page (we don't care about the content of the
 137 page after the syscall).
 138
 139 Process dying
 140
 141 The kernel should be aware of the current pages used for tracing in each thread.
 142 If a thread dies unexpectedly, we want the kernel to get the last bits of
 143 information before the thread crashes.
 144
 145 syscall set_tracing_info
 146
 147 parameter 1 : array of 32 user space pointers to current pages or NULL.
 148
 149
 150 Memory protection
 151
 152 We want each process to be usable to make a trace unreadable, and each process
 153 to have its own memory space.
 154
 155 Two possibilities :
 156
 157 Either we create one channel per process, or we have per cpu tracefiles for all
 158 the processes, with the specification that data is written in a monotically
 159 increasing time order and that no process share a 4k page with another process.
 160
 161 The problem with having only one tracefile per cpu is that we cannot safely
 162 steal a process'buffer upon a schedule change because it may be currently
 163 writing to it.
 164
 165 It leaves the one tracefile per thread as the only solution.
 166
 167 Another argument in favor of this solution is the possibility to have mixed
 168 32-64 bits processes on the same machine. Dealing with types will be easier.
 169
 170
 171 Corrupted trace
 172
 173 A corrupted tracefile will only affect one thread. The rest of the trace will
 174 still be readable.
 175
 176
 177 Facilities
 178
 179 Upon process creation or when receiving the signal of trace info update, when a
 180 new trace appears, the thread should write the facility information into it. It
 181 must then have a list of registered facilities, all done at the thread level.
 182
 183 We must decide if we allow a facility channel for each thread. The advantage is
 184 that we have a readable channel in flight recorder mode, while the disadvantage
 185 is to duplicate the number of channels, which may become quite high. To follow
 186 the general design of a high throughput channel and a low throughput channel for
 187 vital information, I suggest to have a separate channel for facilities, per
 188 trace, per process.
 189
 190
 191
 192
 193
 194