update bench
[lttv.git] / tests / markers / markers-microbench-0.2.txt
CommitLineData
bc82195a 1
2
3* Microbenchmarks
4
5Use timestamp counter to calculate the time spent, with interrupts disabled.
b9c0c958 6Machine : Pentium 4 3GHz, 1GB ram
bc82195a 7Fully preemptible kernel
bc82195a 8Linux Kernel Markers 0.19
b9c0c958 9Kernel : Linux 2.6.17
10
11marker : MARK(subsys_mark1, "%d %p", 1, NULL);
12
13This marker, with two elements (integer and pointer) have been chosen because it
14is representative of high volume events. For instance, a trap entry event logs a
15trap_id (long) and an address (pointer). The same applies to system calls, where
16a system call entry event logs both the ID of the system call and the address of
17the caller.
18
bc82195a 19
20* Execute an empty loop
b9c0c958 21
22- Without marker
bc82195a 23NR_LOOPS : 10000000
24time delta (cycles): 15026497
25cycles per loop : 1.50
b9c0c958 26
bc82195a 27- i386 "optimized" : immediate value, test and predicted branch
28 (non connected marker)
29NR_LOOPS : 10000000
30time delta (cycles): 40031640
31cycles per loop : 4.00
b9c0c958 32cycles per loop for marker : 4.00-1.50=2.50
33
bc82195a 34- i386 "generic" : load, test and predicted branch
35 (non connected marker)
36NR_LOOPS : 10000000
37time delta (cycles): 26697878
38cycles per loop : 2.67
b9c0c958 39cycles per loop for marker : 2.67-1.50=1.17
40
bc82195a 41
42* Execute a loop of memcpy 4096 bytes
b9c0c958 43
44This test has been done to show the impact of markers on a system where the
45memory is already used, which is more representative of a running kernel.
46
bc82195a 47- Without marker
48NR_LOOPS : 10000
49time delta (cycles): 12981555
50cycles per loop : 1298.16
b9c0c958 51
bc82195a 52- i386 "optimized" : immediate value, test and predicted branch
53 (non connected marker)
54NR_LOOPS : 10000
55time delta (cycles): 12982290
56cycles per loop : 1298.23
b9c0c958 57cycles per loop for marker : 1298.23-1298.16=0.074
58
bc82195a 59- i386 "generic" : load, test and predicted branch
60 (non connected marker)
61NR_LOOPS : 10000
62time delta (cycles): 13002788
63cycles per loop : 1300.28
b9c0c958 64cycles per loop for marker : 1300.28-1298.16=2.123
bc82195a 65
66
67The following tests are done with the "optimized" markers only
68
b9c0c958 69- Execute a loop with marker enabled, with i386 "fastcall" register argument
70 setup, probe empty. With preemption disabling.
bc82195a 71NR_LOOPS : 100000
72time delta (cycles): 4407608
73cycles per loop : 44.08
b9c0c958 74cycles per loop to disable preemption and setup arguments in registers :
7544.08-4.00=40.08
bc82195a 76
b9c0c958 77- Execute a loop with a marker enabled, with an empty probe. Var args argument
78 setup, probe empty. With preemption disabling.
bc82195a 79NR_LOOPS : 100000
80time delta (cycles): 5210587
81cycles per loop : 52.11
82additional cycles per loop to setup var args : 52.11-44.08=8.03
83
b9c0c958 84- Execute a loop with a marker enabled, with an empty probe. Var args argument
85 setup, probe empty. No preemption disabling.
86NR_LOOPS : 100000
87time delta (cycles): 3363450
88cycles per loop : 33.63
89cycles per loop to disable preemption : 44.08-33.63=10.45
90
91- Execute a loop with marker enabled, with i386 "asmlinkage" arguments expected.
92 Data is copied by the probe. With preemption disabling.
bc82195a 93NR_LOOPS : 100000
94time delta (cycles): 5299837
95cycles per loop : 53.00
96additional cycles per loop to get arguments in probe (from stack) on x86 :
b9c0c958 9753.00-52.11=0.89
bc82195a 98
b9c0c958 99- Execute a loop with marker enabled, with var args probe expecting arguments.
100 Data is copied by the probe. With preemption disabling.
bc82195a 101NR_LOOPS : 100000
102time delta (cycles): 5574300
103cycles per loop : 55.74
104additional cycles per loop to get expected variable arguments on x86 :
105 55.74-53.00=2.74
106
b9c0c958 107- Execute a loop with marker enabled, with var args probe, format string
108 Data is copied by the probe. This is a 6 bytes string to decode.
bc82195a 109processing.
110NR_LOOPS : 100000
111time delta (cycles): 9622117
112cycles per loop : 96.22
b9c0c958 113additional cycles per loop to dynamically parse arguments with a 6 bytes format
114string :
11596.22-55.74=40.48
bc82195a 116
117
118* Assembly code
119
b9c0c958 120The disassembly of the following marked function will be shown :
121
122static int my_open(struct inode *inode, struct file *file)
123{
124 MARK(subsys_mark1, "%d %p", 1, NULL);
125
126 return -EPERM;
127}
128
bc82195a 129
130- Optimized
131
132static int my_open(struct inode *inode, struct file *file)
133{
134 0: 55 push %ebp
135 1: 89 e5 mov %esp,%ebp
136 3: 83 ec 0c sub $0xc,%esp
137 MARK(subsys_mark1, "%d %p", 1, NULL);
b9c0c958 138 6: b0 00 mov $0x0,%al <-- immediate load 0 in al
bc82195a 139 8: 84 c0 test %al,%al
140 a: 75 07 jne 13 <my_open+0x13>
141
142 return -EPERM;
143}
144 c: b8 ff ff ff ff mov $0xffffffff,%eax
145 11: c9 leave
146 12: c3 ret
147 13: b8 01 00 00 00 mov $0x1,%eax
b9c0c958 148 18: e8 fc ff ff ff call 19 <my_open+0x19> <-- preempt_disable
bc82195a 149 1d: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
150 24: 00
151 25: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp)
152 2c: 00
153 2d: c7 04 24 0d 00 00 00 movl $0xd,(%esp)
b9c0c958 154 34: ff 15 74 10 00 00 call *0x1074 <-- function pointer
bc82195a 155 3a: b8 01 00 00 00 mov $0x1,%eax
b9c0c958 156 3f: e8 fc ff ff ff call 40 <my_open+0x40> <-- preempt_enable
bc82195a 157 44: eb c6 jmp c <my_open+0xc>
158
159
160- Generic
161
162static int my_open(struct inode *inode, struct file *file)
163{
164 0: 55 push %ebp
165 1: 89 e5 mov %esp,%ebp
166 3: 83 ec 0c sub $0xc,%esp
167 MARK(subsys_mark1, "%d %p", 1, NULL);
b9c0c958 168 6: 0f b6 05 20 10 00 00 movzbl 0x1020,%eax <-- memory load byte
bc82195a 169 d: 84 c0 test %al,%al
170 f: 75 07 jne 18 <my_open+0x18>
171
172 return -EPERM;
173}
174 11: b8 ff ff ff ff mov $0xffffffff,%eax
175 16: c9 leave
176 17: c3 ret
177 18: b8 01 00 00 00 mov $0x1,%eax
b9c0c958 178 1d: e8 fc ff ff ff call 1e <my_open+0x1e> <-- preempt_disable
bc82195a 179 22: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
180 29: 00
181 2a: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp)
182 31: 00
183 32: c7 04 24 0d 00 00 00 movl $0xd,(%esp)
b9c0c958 184 39: ff 15 74 10 00 00 call *0x1074 <-- function pointer
bc82195a 185 3f: b8 01 00 00 00 mov $0x1,%eax
b9c0c958 186 44: e8 fc ff ff ff call 45 <my_open+0x45> <-- preempt_enable
bc82195a 187 49: eb c6 jmp 11 <my_open+0x11>
188
b9c0c958 189
190
191Here is the typical var arg probe that has been used in those tests. It saves
192the values expectes as parameters in global variables. The DO_MARK1_FORMAT
193define is used for probe registration to make sure that it will be connected
194with a marker that has a matching format string. Note that this checking is
195optional : the probe can register with a NULL format and afterward check itself
196the format string received in parameter dynamically.
197
198int value;
199void *ptr;
200
201#define DO_MARK1_FORMAT "%d %p"
202void do_mark1(const char *format, ...)
203{
204 va_list ap;
205
206 va_start(ap, format);
207 value = va_arg(ap, int);
208 ptr = va_arg(ap, void*);
209
210 va_end(ap);
211}
212
213
214Here is the disassembly of the probe :
215
216#define DO_MARK1_FORMAT "%d %p"
217void do_mark1(const char *format, ...)
218{
219 0: 55 push %ebp
220 1: 89 e5 mov %esp,%ebp
221 3: 83 ec 04 sub $0x4,%esp
222 va_list ap;
223
224 va_start(ap, format);
225 value = va_arg(ap, int);
226 6: 8b 45 0c mov 0xc(%ebp),%eax
227 9: a3 00 00 00 00 mov %eax,0x0
228 ptr = va_arg(ap, void*);
229 e: 8b 45 10 mov 0x10(%ebp),%eax
230 11: a3 00 00 00 00 mov %eax,0x0
231
232 va_end(ap);
233}
234 16: c9 leave
235 17: c3 ret
236
237
bc82195a 238* Size (x86)
239
b9c0c958 240This is the size added by each marker to the memory image :
241
bc82195a 242- Optimized
243
b9c0c958 244.text section : instructions
bc82195a 245Adds 6 bytes in the "likely" path.
246Adds 32 bytes in the "unlikely" path.
b9c0c958 247.data section : r/w data
2480 byte
249.rodata.str1 : strings
250Length of the marker name
251.debug_str : strings (if loaded..)
252Length of the marker name + 7 bytes (__mark_)
253.markers
2548 bytes (2 pointers)
255.markers.c
25612 bytes (3 pointers)
257
258- Generic
259
260.text section : instructions
bc82195a 261Adds 11 bytes in the "likely" path.
262Adds 32 bytes in the "unlikely" path.
b9c0c958 263.data section : r/w data
2641 byte (the activation flag)
265.rodata.str1 : strings
266Length of the marker name
267.debug_str : strings (if loaded..)
268Length of the marker name + 7 bytes (__mark_)
269.markers
2708 bytes (2 pointers)
271.markers.c
27212 bytes (3 pointers)
bc82195a 273
274
275* Macrobenchmarks
276
277Compiling a 2.6.17 kernel on a Pentium 4 3GHz, 1GB ram, cold cache.
278Running a 2.6.17 vanilla kernel :
279real 8m2.443s
280user 7m35.124s
281sys 0m34.950s
282
283Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
284real 8m1.635s
285user 7m34.552s
286sys 0m36.298s
287
288Ping flood on loopback interface :
289Running a 2.6.17 vanilla kernel :
290136596 packets transmitted, 136596 packets received, 0% packet loss
291round-trip min/avg/max = 0.0/0.0/0.1 ms
292
293real 0m10.840s
294user 0m0.360s
295sys 0m10.485s
296
29712601 packets transmitted/s
298
299Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
300108504 packets transmitted, 108504 packets received, 0% packet loss
301round-trip min/avg/max = 0.0/0.0/0.1 ms
302
303real 0m8.614s
304user 0m0.264s
305sys 0m8.353s
306
30712596 packets transmitted/s
308
309
310
311Conclusion
312
313In an empty loop, the generic marker is faster than the optimized marker. This
314may be due to better performances of the movzbl instruction over the movb on the
315Pentium 4 architecture. However, when we execute a loop of 4kB copy, the impact
316of the movzbl becomes greater because it uses the memory bandwidth.
317
318The preemption disabling and call to a probe itself costs 48.11 cycles, almost
319as much as dynamically parsing the format string to get the variable arguments
320(40.48 cycles).
321
322There is almost no difference, on x86, between passing the arguments directly on
323the stack and using a variable argument list when its layout is known
324statically (0.89 cycles vs 2.74 cycles).
325
b9c0c958 326The int3 approach for adding instrumentation dynamically saves the 0.074 cycle
327(typcal use, high memory usage) used by the optimized marker by adding the
328ability to insert a breakpoint at any location without any impact on the code
329when inactive. This breakpoint based approach is very useful to instrument core
330kernel code that has not been previously marked without need to recompile and
331reboot. We can therefore compare the case "without markers" to the null impact
332of the int3 breakpoint based approach when inactive.
333
334
335
336
337
338
339
340
341
bc82195a 342
343
This page took 0.035587 seconds and 4 git commands to generate.