marker tests
[lttv.git] / tests / markers / markers-microbench-0.2.txt
1
2
3 * Microbenchmarks
4
5 Use timestamp counter to calculate the time spent, with interrupts disabled.
6 Machine : Pentium 4 3GHz
7 Fully preemptible kernel
8 marker : MARK(subsys_mark1, "%d %p", 1, NULL);
9 Linux Kernel Markers 0.19
10
11 * Execute an empty loop
12 NR_LOOPS : 10000000
13 time delta (cycles): 15026497
14 cycles per loop : 1.50
15 - i386 "optimized" : immediate value, test and predicted branch
16 (non connected marker)
17 NR_LOOPS : 10000000
18 time delta (cycles): 40031640
19 cycles per loop : 4.00
20 cycles per loop for marker : 2.50
21 - i386 "generic" : load, test and predicted branch
22 (non connected marker)
23 NR_LOOPS : 10000000
24 time delta (cycles): 26697878
25 cycles per loop : 2.67
26 cycles per loop for marker : 1.17
27
28 * Execute a loop of memcpy 4096 bytes
29 - Without marker
30 NR_LOOPS : 10000
31 time delta (cycles): 12981555
32 cycles per loop : 1298.16
33 - i386 "optimized" : immediate value, test and predicted branch
34 (non connected marker)
35 NR_LOOPS : 10000
36 time delta (cycles): 12982290
37 cycles per loop : 1298.23
38 cycles per loop for marker : 0.074
39 - i386 "generic" : load, test and predicted branch
40 (non connected marker)
41 NR_LOOPS : 10000
42 time delta (cycles): 13002788
43 cycles per loop : 1300.28
44 cycles per loop for marker : 2.123
45
46
47 The following tests are done with the "optimized" markers only
48
49 Execute a loop with marker enabled, with i386 "fastcall" register argument
50 setup, probe empty.
51 NR_LOOPS : 100000
52 time delta (cycles): 4407608
53 cycles per loop : 44.08
54 cycles per loop to setup arguments in registers : 44.08-4.00=40.08
55
56 Execute a loop with a marker enabled, with an empty probe. Var args argument
57 setup, probe empty.
58 NR_LOOPS : 100000
59 time delta (cycles): 5210587
60 cycles per loop : 52.11
61 additional cycles per loop to setup var args : 52.11-44.08=8.03
62
63 Execute a loop with marker enabled, with i386 "asmlinkage" arguments expected.
64 Data is copied by the probe.
65 NR_LOOPS : 100000
66 time delta (cycles): 5299837
67 cycles per loop : 53.00
68 additional cycles per loop to get arguments in probe (from stack) on x86 :
69 53.00-52.11=0.89
70
71 Execute a loop with marker enabled, with var args probe expecting arguments.
72 Data is copied by the probe.
73 NR_LOOPS : 100000
74 time delta (cycles): 5574300
75 cycles per loop : 55.74
76 additional cycles per loop to get expected variable arguments on x86 :
77 55.74-53.00=2.74
78
79 Execute a loop with marker enabled, with var args probe, format string
80 Data is copied by the probe.
81 processing.
82 NR_LOOPS : 100000
83 time delta (cycles): 9622117
84 cycles per loop : 96.22
85 additional cycles per loop to dynamically parse arguments with format string :
86 96.22-55.74=40.48
87
88
89 * Assembly code
90
91
92 - Optimized
93
94 static int my_open(struct inode *inode, struct file *file)
95 {
96 0: 55 push %ebp
97 1: 89 e5 mov %esp,%ebp
98 3: 83 ec 0c sub $0xc,%esp
99 MARK(subsys_mark1, "%d %p", 1, NULL);
100 6: b0 00 mov $0x0,%al
101 8: 84 c0 test %al,%al
102 a: 75 07 jne 13 <my_open+0x13>
103
104 return -EPERM;
105 }
106 c: b8 ff ff ff ff mov $0xffffffff,%eax
107 11: c9 leave
108 12: c3 ret
109 13: b8 01 00 00 00 mov $0x1,%eax
110 18: e8 fc ff ff ff call 19 <my_open+0x19>
111 1d: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
112 24: 00
113 25: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp)
114 2c: 00
115 2d: c7 04 24 0d 00 00 00 movl $0xd,(%esp)
116 34: ff 15 74 10 00 00 call *0x1074
117 3a: b8 01 00 00 00 mov $0x1,%eax
118 3f: e8 fc ff ff ff call 40 <my_open+0x40>
119 44: eb c6 jmp c <my_open+0xc>
120
121
122 - Generic
123
124 static int my_open(struct inode *inode, struct file *file)
125 {
126 0: 55 push %ebp
127 1: 89 e5 mov %esp,%ebp
128 3: 83 ec 0c sub $0xc,%esp
129 MARK(subsys_mark1, "%d %p", 1, NULL);
130 6: 0f b6 05 20 10 00 00 movzbl 0x1020,%eax
131 d: 84 c0 test %al,%al
132 f: 75 07 jne 18 <my_open+0x18>
133
134 return -EPERM;
135 }
136 11: b8 ff ff ff ff mov $0xffffffff,%eax
137 16: c9 leave
138 17: c3 ret
139 18: b8 01 00 00 00 mov $0x1,%eax
140 1d: e8 fc ff ff ff call 1e <my_open+0x1e>
141 22: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
142 29: 00
143 2a: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp)
144 31: 00
145 32: c7 04 24 0d 00 00 00 movl $0xd,(%esp)
146 39: ff 15 74 10 00 00 call *0x1074
147 3f: b8 01 00 00 00 mov $0x1,%eax
148 44: e8 fc ff ff ff call 45 <my_open+0x45>
149 49: eb c6 jmp 11 <my_open+0x11>
150
151 * Size (x86)
152
153 - Optimized
154
155 Adds 6 bytes in the "likely" path.
156 Adds 32 bytes in the "unlikely" path.
157
158 - Generic
159
160 Adds 11 bytes in the "likely" path.
161 Adds 32 bytes in the "unlikely" path.
162
163
164 * Macrobenchmarks
165
166 Compiling a 2.6.17 kernel on a Pentium 4 3GHz, 1GB ram, cold cache.
167 Running a 2.6.17 vanilla kernel :
168 real 8m2.443s
169 user 7m35.124s
170 sys 0m34.950s
171
172 Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
173 real 8m1.635s
174 user 7m34.552s
175 sys 0m36.298s
176
177 Ping flood on loopback interface :
178 Running a 2.6.17 vanilla kernel :
179 136596 packets transmitted, 136596 packets received, 0% packet loss
180 round-trip min/avg/max = 0.0/0.0/0.1 ms
181
182 real 0m10.840s
183 user 0m0.360s
184 sys 0m10.485s
185
186 12601 packets transmitted/s
187
188 Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
189 108504 packets transmitted, 108504 packets received, 0% packet loss
190 round-trip min/avg/max = 0.0/0.0/0.1 ms
191
192 real 0m8.614s
193 user 0m0.264s
194 sys 0m8.353s
195
196 12596 packets transmitted/s
197
198
199
200 Conclusion
201
202 In an empty loop, the generic marker is faster than the optimized marker. This
203 may be due to better performances of the movzbl instruction over the movb on the
204 Pentium 4 architecture. However, when we execute a loop of 4kB copy, the impact
205 of the movzbl becomes greater because it uses the memory bandwidth.
206
207 The preemption disabling and call to a probe itself costs 48.11 cycles, almost
208 as much as dynamically parsing the format string to get the variable arguments
209 (40.48 cycles).
210
211 There is almost no difference, on x86, between passing the arguments directly on
212 the stack and using a variable argument list when its layout is known
213 statically (0.89 cycles vs 2.74 cycles).
214
215
216
This page took 0.034313 seconds and 5 git commands to generate.