marker tests
[lttv.git] / tests / markers / markers-microbench-0.2.txt
CommitLineData
bc82195a 1
2
3* Microbenchmarks
4
5Use timestamp counter to calculate the time spent, with interrupts disabled.
6Machine : Pentium 4 3GHz
7Fully preemptible kernel
8marker : MARK(subsys_mark1, "%d %p", 1, NULL);
9Linux Kernel Markers 0.19
10
11* Execute an empty loop
12NR_LOOPS : 10000000
13time delta (cycles): 15026497
14cycles per loop : 1.50
15- i386 "optimized" : immediate value, test and predicted branch
16 (non connected marker)
17NR_LOOPS : 10000000
18time delta (cycles): 40031640
19cycles per loop : 4.00
20cycles per loop for marker : 2.50
21- i386 "generic" : load, test and predicted branch
22 (non connected marker)
23NR_LOOPS : 10000000
24time delta (cycles): 26697878
25cycles per loop : 2.67
26cycles per loop for marker : 1.17
27
28* Execute a loop of memcpy 4096 bytes
29- Without marker
30NR_LOOPS : 10000
31time delta (cycles): 12981555
32cycles per loop : 1298.16
33- i386 "optimized" : immediate value, test and predicted branch
34 (non connected marker)
35NR_LOOPS : 10000
36time delta (cycles): 12982290
37cycles per loop : 1298.23
38cycles per loop for marker : 0.074
39- i386 "generic" : load, test and predicted branch
40 (non connected marker)
41NR_LOOPS : 10000
42time delta (cycles): 13002788
43cycles per loop : 1300.28
44cycles per loop for marker : 2.123
45
46
47The following tests are done with the "optimized" markers only
48
49Execute a loop with marker enabled, with i386 "fastcall" register argument
50 setup, probe empty.
51NR_LOOPS : 100000
52time delta (cycles): 4407608
53cycles per loop : 44.08
54cycles per loop to setup arguments in registers : 44.08-4.00=40.08
55
56Execute a loop with a marker enabled, with an empty probe. Var args argument
57 setup, probe empty.
58NR_LOOPS : 100000
59time delta (cycles): 5210587
60cycles per loop : 52.11
61additional cycles per loop to setup var args : 52.11-44.08=8.03
62
63Execute a loop with marker enabled, with i386 "asmlinkage" arguments expected.
64 Data is copied by the probe.
65NR_LOOPS : 100000
66time delta (cycles): 5299837
67cycles per loop : 53.00
68additional cycles per loop to get arguments in probe (from stack) on x86 :
69 53.00-52.11=0.89
70
71Execute a loop with marker enabled, with var args probe expecting arguments.
72 Data is copied by the probe.
73NR_LOOPS : 100000
74time delta (cycles): 5574300
75cycles per loop : 55.74
76additional cycles per loop to get expected variable arguments on x86 :
77 55.74-53.00=2.74
78
79Execute a loop with marker enabled, with var args probe, format string
80 Data is copied by the probe.
81processing.
82NR_LOOPS : 100000
83time delta (cycles): 9622117
84cycles per loop : 96.22
85additional cycles per loop to dynamically parse arguments with format string :
86 96.22-55.74=40.48
87
88
89* Assembly code
90
91
92- Optimized
93
94static int my_open(struct inode *inode, struct file *file)
95{
96 0: 55 push %ebp
97 1: 89 e5 mov %esp,%ebp
98 3: 83 ec 0c sub $0xc,%esp
99 MARK(subsys_mark1, "%d %p", 1, NULL);
100 6: b0 00 mov $0x0,%al
101 8: 84 c0 test %al,%al
102 a: 75 07 jne 13 <my_open+0x13>
103
104 return -EPERM;
105}
106 c: b8 ff ff ff ff mov $0xffffffff,%eax
107 11: c9 leave
108 12: c3 ret
109 13: b8 01 00 00 00 mov $0x1,%eax
110 18: e8 fc ff ff ff call 19 <my_open+0x19>
111 1d: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
112 24: 00
113 25: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp)
114 2c: 00
115 2d: c7 04 24 0d 00 00 00 movl $0xd,(%esp)
116 34: ff 15 74 10 00 00 call *0x1074
117 3a: b8 01 00 00 00 mov $0x1,%eax
118 3f: e8 fc ff ff ff call 40 <my_open+0x40>
119 44: eb c6 jmp c <my_open+0xc>
120
121
122- Generic
123
124static int my_open(struct inode *inode, struct file *file)
125{
126 0: 55 push %ebp
127 1: 89 e5 mov %esp,%ebp
128 3: 83 ec 0c sub $0xc,%esp
129 MARK(subsys_mark1, "%d %p", 1, NULL);
130 6: 0f b6 05 20 10 00 00 movzbl 0x1020,%eax
131 d: 84 c0 test %al,%al
132 f: 75 07 jne 18 <my_open+0x18>
133
134 return -EPERM;
135}
136 11: b8 ff ff ff ff mov $0xffffffff,%eax
137 16: c9 leave
138 17: c3 ret
139 18: b8 01 00 00 00 mov $0x1,%eax
140 1d: e8 fc ff ff ff call 1e <my_open+0x1e>
141 22: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
142 29: 00
143 2a: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp)
144 31: 00
145 32: c7 04 24 0d 00 00 00 movl $0xd,(%esp)
146 39: ff 15 74 10 00 00 call *0x1074
147 3f: b8 01 00 00 00 mov $0x1,%eax
148 44: e8 fc ff ff ff call 45 <my_open+0x45>
149 49: eb c6 jmp 11 <my_open+0x11>
150
151* Size (x86)
152
153- Optimized
154
155Adds 6 bytes in the "likely" path.
156Adds 32 bytes in the "unlikely" path.
157
158- Generic
159
160Adds 11 bytes in the "likely" path.
161Adds 32 bytes in the "unlikely" path.
162
163
164* Macrobenchmarks
165
166Compiling a 2.6.17 kernel on a Pentium 4 3GHz, 1GB ram, cold cache.
167Running a 2.6.17 vanilla kernel :
168real 8m2.443s
169user 7m35.124s
170sys 0m34.950s
171
172Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
173real 8m1.635s
174user 7m34.552s
175sys 0m36.298s
176
177Ping flood on loopback interface :
178Running a 2.6.17 vanilla kernel :
179136596 packets transmitted, 136596 packets received, 0% packet loss
180round-trip min/avg/max = 0.0/0.0/0.1 ms
181
182real 0m10.840s
183user 0m0.360s
184sys 0m10.485s
185
18612601 packets transmitted/s
187
188Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
189108504 packets transmitted, 108504 packets received, 0% packet loss
190round-trip min/avg/max = 0.0/0.0/0.1 ms
191
192real 0m8.614s
193user 0m0.264s
194sys 0m8.353s
195
19612596 packets transmitted/s
197
198
199
200Conclusion
201
202In an empty loop, the generic marker is faster than the optimized marker. This
203may be due to better performances of the movzbl instruction over the movb on the
204Pentium 4 architecture. However, when we execute a loop of 4kB copy, the impact
205of the movzbl becomes greater because it uses the memory bandwidth.
206
207The preemption disabling and call to a probe itself costs 48.11 cycles, almost
208as much as dynamically parsing the format string to get the variable arguments
209(40.48 cycles).
210
211There is almost no difference, on x86, between passing the arguments directly on
212the stack and using a variable argument list when its layout is known
213statically (0.89 cycles vs 2.74 cycles).
214
215
216
This page took 0.030058 seconds and 4 git commands to generate.