| 1 | Benjamin Poirier |
| 2 | benjamin.poirier@polymtl.ca |
| 3 | 2009 |
| 4 | |
| 5 | + About time synchronization |
| 6 | This framework performs offline time synchronization. This means that the |
| 7 | synchronization is done after tracing is over. It is not the same as online |
| 8 | synchronization like what is done by NTP. Nor is it directly influenced by it. |
| 9 | |
| 10 | Event timestamps are adjusted according to a clock correction function that |
| 11 | palliates for initial offset and rate offset (ie. clocks that don't start out |
| 12 | at the same value and clocks that don't run at the same speed). It can work on |
| 13 | two or more traces. |
| 14 | |
| 15 | The synchronization is based on relations identified in network traffic |
| 16 | between nodes. So, for it to work, there must be traffic exchanged between the |
| 17 | nodes. At the moment, this must be TCP traffic. Any kind will do (ssh, http, |
| 18 | ...) |
| 19 | |
| 20 | For scientific information about the algorithms used, see: |
| 21 | * Duda, A., Harrus, G., Haddad, Y., and Bernard, G.: Estimating global time in |
| 22 | distributed systems, Proc. 7th Int. Conf. on Distributed Computing Systems, |
| 23 | Berlin, volume 18, 1987 |
| 24 | * Ashton, P.: Algorithms for Off-line Clock Synchronisation, University of |
| 25 | Canterbury, December 1995 |
| 26 | http://www.cosc.canterbury.ac.nz/research/reports/TechReps/1995/tr_9512.pdf |
| 27 | |
| 28 | + Using time synchronization |
| 29 | ++ Recording traces |
| 30 | To use time synchronization you have to record traces on multiple nodes |
| 31 | simultaneously with lttng (the tracer). While recording the traces, you have |
| 32 | to make sure the following markers are enabled: |
| 33 | * dev_receive |
| 34 | * dev_xmit_extended |
| 35 | * tcpv4_rcv_extended |
| 36 | * udpv4_rcv_extended |
| 37 | You can use the 'ltt-armall' and 'ltt-armnetsync' scripts for this. |
| 38 | |
| 39 | You also have to make sure there is some TCP traffic between the traced nodes. |
| 40 | |
| 41 | ++ Viewing traces |
| 42 | Afterwards, you have to make sure all the traces are accessible from a single |
| 43 | machine, where lttv (the viewer) is run. |
| 44 | |
| 45 | Time synchronization is enabled and controlled via the following lttv options, |
| 46 | as seen with "-h": |
| 47 | --sync |
| 48 | synchronize the time between the traces |
| 49 | --sync-stats |
| 50 | print statistics about the time synchronization |
| 51 | --sync-null |
| 52 | read the events but do not perform any processing, this |
| 53 | is mostly for performance evaluation |
| 54 | --sync-analysis - argument: chull, linreg |
| 55 | specify the algorithm to use for event analysis |
| 56 | --sync-graphs |
| 57 | output gnuplot graph showing synchronization points |
| 58 | --sync-graphs-dir - argument: DIRECTORY |
| 59 | specify the directory where to store the graphs, by |
| 60 | default in "graphs-<lttv-pid>" |
| 61 | |
| 62 | To enable synchronization, start lttv with the "--sync" option. It can be |
| 63 | used in text mode or in GUI mode. You can add the traces one by one in the GUI |
| 64 | but this will recompute the synchronization after every trace that is added. |
| 65 | Instead, you can save some time by specifying all your traces on the command |
| 66 | line (using -t). |
| 67 | |
| 68 | Example: |
| 69 | lttv-gui -t traces/node1 -t traces/node2 --sync |
| 70 | |
| 71 | ++ Statistics |
| 72 | The --sync-stats option is useful to make sure the synchronization algorithms |
| 73 | worked. Here is an example output (with added comments) from a successful |
| 74 | chull (one of the synchronization algorithms) run of two traces: |
| 75 | LTTV processing stats: |
| 76 | received frames: 452 |
| 77 | received frames that are IP: 452 |
| 78 | received and processed packets that are TCP: 268 |
| 79 | sent packets that are TCP: 275 |
| 80 | TCP matching stats: |
| 81 | total input and output events matched together to form a packet: 240 |
| 82 | Message traffic: |
| 83 | 0 - 1 : sent 60 received 60 |
| 84 | # Note that 60 + 60 < 240, this is because there was loopback traffic, which is |
| 85 | # discarded. |
| 86 | Convex hull analysis stats: |
| 87 | out of order packets dropped from analysis: 0 |
| 88 | Number of points in convex hulls: |
| 89 | 0 - 1 : lower half-hull 7 upper half-hull 9 |
| 90 | Individual synchronization factors: |
| 91 | 0 - 1 : Middle a0= -1.33641e+08 a1= 1 - 4.5276e-08 accuracy 1.35355e-05 |
| 92 | a0: -1.34095e+08 to -1.33187e+08 (delta= 907388) |
| 93 | a1: 1 -6.81298e-06 to +6.72248e-06 (delta= 1.35355e-05) |
| 94 | Resulting synchronization factors: |
| 95 | trace 0 drift= 1 offset= 0 (0.000000) start time= 18.799023588 |
| 96 | trace 1 drift= 1 offset= 1.33641e+08 (0.066818) start time= 19.090688494 |
| 97 | Synchronization time: |
| 98 | real time: 0.113308 |
| 99 | user time: 0.112007 |
| 100 | system time: 0.000000 |
| 101 | |
| 102 | ++ Algorithms |
| 103 | The synchronization framework is extensible and already includes two |
| 104 | algorithms: chull and linreg. You can choose which analysis algorithm to use |
| 105 | with the --sync-analysis option. |
| 106 | |
| 107 | + Design |
| 108 | This part describes the design of the synchronization framework. This is to |
| 109 | help programmers interested in: |
| 110 | * adding new synchronization algorithms (analysis part) |
| 111 | There are already two analysis algorithms available: chull and linreg |
| 112 | * using new types of events (processing and matching parts) |
| 113 | * using time synchronization with another data source/tracer (processing part) |
| 114 | There are already two data sources available: lttng and unittest |
| 115 | |
| 116 | ++ Sync chain |
| 117 | This part is specific to the framework in use: the program doing |
| 118 | synchronization, the executable linking to the event_*.o |
| 119 | eg. LTTV, unittest |
| 120 | |
| 121 | This reads parameters, creates SyncState and calls the processing init |
| 122 | function. The "sync chain" is the set of event-* modules. At the moment there |
| 123 | is only one module at each stage. However, as more module are added, it will |
| 124 | become relevant to have many modules at the same stage simultaneously. This |
| 125 | will require some modifications. I've kept this possibility at the back of my |
| 126 | mind while designing. |
| 127 | |
| 128 | ++ Stage 1: Event processing |
| 129 | Specific to the tracing data source. |
| 130 | eg. LTTng, LTT userspace, libpcap |
| 131 | |
| 132 | Read the events from the trace and stuff them in an appropriate Event object. |
| 133 | |
| 134 | ++ Communication between stages 1 and 2: events |
| 135 | Communication is done via objects specialized from Event. At the moment, all |
| 136 | *Event are in data_structures.h. Specific event structures and functions could |
| 137 | be in separate files. This way, adding a new set of modules would require |
| 138 | shipping extra data_structures* files instead of modifying the existing one. |
| 139 | For this to work, Event.type couldn't be an enum, it could be an int and use |
| 140 | #defines or constants defined in the specialized data_structures* files. |
| 141 | Event.event could be a void*. |
| 142 | |
| 143 | ++ Stage 2: Event matching |
| 144 | This stage and its modules are specific to the type of event. Event processing |
| 145 | feeds the events one at a time but event analysis works on groups of events. |
| 146 | Event matching is responsible for forming these groups. Generally speaking, |
| 147 | these can have different types of relation ("one to one", "one to many", or a |
| 148 | mix) and it will influence the overall behavior of the module. |
| 149 | eg. TCP, UDP, MPI |
| 150 | |
| 151 | matchEvent() takes an Event pointer. An actual matching module doesn't have to |
| 152 | be able to process every type of event. It will only be passed events of a |
| 153 | type it can process (according to the .canMatch field of its MatchingModule |
| 154 | struct). |
| 155 | |
| 156 | ++ Communication between stages 2 and 3: event groups |
| 157 | Communication consists of events grouped in Message, Exchange or Broadcast |
| 158 | structs. |
| 159 | |
| 160 | About exchanges: |
| 161 | If one event pair is a packet (more generally, something representable as a |
| 162 | Message), an exchange is composed of at least two packets, one in each |
| 163 | direction. There should be a non-negative minimum "round trip time" (RTT) |
| 164 | between the first and last event of the exchange. This RTT should be as small |
| 165 | as possible so these packets should be closely related in time like a data |
| 166 | packet and an acknowledgement packet. If the events analyzed are such that the |
| 167 | minimum RTT can be zero, there's nothing gained in analyzing exchanges beyond |
| 168 | what can already be figured out by analyzing packets. |
| 169 | |
| 170 | An exchange can also consist of more than two packets, in case one packet |
| 171 | single handedly acknowledges many data packets. In this case, it is best to |
| 172 | use the last acknowledged packet. Assuming a linear clock, an acknowledged |
| 173 | packet is as good as any other. However, since the linear clock assumption is |
| 174 | further from reality as the interval grows longer, it is best to keep the |
| 175 | interval between the two packets as short as possible. |
| 176 | |
| 177 | ++ Stage 3: Event analysis |
| 178 | This stage and its modules are specific to the algorithm that analyzes events |
| 179 | to deduce synchronization factors. |
| 180 | eg. convex hull, linear regression, broadcast Maximum Likelihood Estimator |
| 181 | |
| 182 | Instead of having one analyzeEvents() function that can receive any sort of |
| 183 | grouping of events, there are three prototypes: analyzeMessage(), |
| 184 | analyzeExchange() and analyzeBroadcast(). A module implements only the |
| 185 | relevant one(s) and sets the other function pointers to NULL in its |
| 186 | AnalysisModule struct. |
| 187 | |
| 188 | The approach is different from matchEvent() where there is one point of entry |
| 189 | no mather the type of event. The analyze*() approach has the advantage that |
| 190 | there is no casting or type detection to do. It is also possible to deduce |
| 191 | from the functions pointers which groupings of events a module can analyze. |
| 192 | However, it means each analysis module will have to be modified if there is |
| 193 | ever a new type of event grouping. |
| 194 | |
| 195 | I chose this approach because: |
| 196 | 1) I thought it likely that there will be new types of events but not so |
| 197 | likely that there will be new types of event groups. |
| 198 | 2) all events share some members (time, traceNb, ...) but not event groups |
| 199 | 3) we'll see which one of the two approaches works best and we can adapt |
| 200 | later. |
| 201 | |
| 202 | ++ Data flow |
| 203 | Data from traces flows "down" from processing to matching to analysis. Factors |
| 204 | come back up. |
| 205 | |
| 206 | ++ Evolution and adaptation |
| 207 | It is possible to change/add another sync chain and to add other event_* |
| 208 | modules. It has been done. New types of events may need to be added to |
| 209 | data_structures.h. This is only to link between Event-* modules. If the data |
| 210 | does not have to be shared, data_structures.h does not have to be modified. |
| 211 | |
| 212 | At the moment there is some code duplication in the last steps of linreg and |
| 213 | chull analysis: the code to propagate the factors when there are more than two |
| 214 | nodes. Maybe there could be a Stage 4 that does that? |