bc82195a |
1 | |
2 | |
3 | * Microbenchmarks |
4 | |
5 | Use timestamp counter to calculate the time spent, with interrupts disabled. |
6 | Machine : Pentium 4 3GHz |
7 | Fully preemptible kernel |
8 | marker : MARK(subsys_mark1, "%d %p", 1, NULL); |
9 | Linux Kernel Markers 0.19 |
10 | |
11 | * Execute an empty loop |
12 | NR_LOOPS : 10000000 |
13 | time delta (cycles): 15026497 |
14 | cycles per loop : 1.50 |
15 | - i386 "optimized" : immediate value, test and predicted branch |
16 | (non connected marker) |
17 | NR_LOOPS : 10000000 |
18 | time delta (cycles): 40031640 |
19 | cycles per loop : 4.00 |
20 | cycles per loop for marker : 2.50 |
21 | - i386 "generic" : load, test and predicted branch |
22 | (non connected marker) |
23 | NR_LOOPS : 10000000 |
24 | time delta (cycles): 26697878 |
25 | cycles per loop : 2.67 |
26 | cycles per loop for marker : 1.17 |
27 | |
28 | * Execute a loop of memcpy 4096 bytes |
29 | - Without marker |
30 | NR_LOOPS : 10000 |
31 | time delta (cycles): 12981555 |
32 | cycles per loop : 1298.16 |
33 | - i386 "optimized" : immediate value, test and predicted branch |
34 | (non connected marker) |
35 | NR_LOOPS : 10000 |
36 | time delta (cycles): 12982290 |
37 | cycles per loop : 1298.23 |
38 | cycles per loop for marker : 0.074 |
39 | - i386 "generic" : load, test and predicted branch |
40 | (non connected marker) |
41 | NR_LOOPS : 10000 |
42 | time delta (cycles): 13002788 |
43 | cycles per loop : 1300.28 |
44 | cycles per loop for marker : 2.123 |
45 | |
46 | |
47 | The following tests are done with the "optimized" markers only |
48 | |
49 | Execute a loop with a marker enabled, with an empty probe. |
50 | NR_LOOPS : 100000 |
51 | time delta (cycles): 5210587 |
52 | cycles per loop : 52.11 |
53 | cycles per loop for empty probe : 52.11-4.00=48.11 |
54 | |
55 | Execute a loop with marker enabled, with i386 direct argument passing. |
56 | NR_LOOPS : 100000 |
57 | time delta (cycles): 5299837 |
58 | cycles per loop : 53.00 |
59 | cycles per loop to get arguments in probe (from stack) on x86 : 53.00-52.11=0.89 |
60 | |
61 | Execute a loop with marker enabled, with var args probe. |
62 | NR_LOOPS : 100000 |
63 | time delta (cycles): 5574300 |
64 | cycles per loop : 55.74 |
65 | cycles per loop to get expected variable arguments on x86 : 55.74-53.00=2.74 |
66 | |
67 | Execute a loop with marker enabled, with var args probe, format string |
68 | processing. |
69 | NR_LOOPS : 100000 |
70 | time delta (cycles): 9622117 |
71 | cycles per loop : 96.22 |
72 | cycles per loop to dynamically parse arguments |
73 | with format string : 96.22-55.74=40.48 |
74 | |
75 | |
76 | * Assembly code |
77 | |
78 | |
79 | - Optimized |
80 | |
81 | static int my_open(struct inode *inode, struct file *file) |
82 | { |
83 | 0: 55 push %ebp |
84 | 1: 89 e5 mov %esp,%ebp |
85 | 3: 83 ec 0c sub $0xc,%esp |
86 | MARK(subsys_mark1, "%d %p", 1, NULL); |
87 | 6: b0 00 mov $0x0,%al |
88 | 8: 84 c0 test %al,%al |
89 | a: 75 07 jne 13 <my_open+0x13> |
90 | |
91 | return -EPERM; |
92 | } |
93 | c: b8 ff ff ff ff mov $0xffffffff,%eax |
94 | 11: c9 leave |
95 | 12: c3 ret |
96 | 13: b8 01 00 00 00 mov $0x1,%eax |
97 | 18: e8 fc ff ff ff call 19 <my_open+0x19> |
98 | 1d: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp) |
99 | 24: 00 |
100 | 25: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp) |
101 | 2c: 00 |
102 | 2d: c7 04 24 0d 00 00 00 movl $0xd,(%esp) |
103 | 34: ff 15 74 10 00 00 call *0x1074 |
104 | 3a: b8 01 00 00 00 mov $0x1,%eax |
105 | 3f: e8 fc ff ff ff call 40 <my_open+0x40> |
106 | 44: eb c6 jmp c <my_open+0xc> |
107 | |
108 | |
109 | - Generic |
110 | |
111 | static int my_open(struct inode *inode, struct file *file) |
112 | { |
113 | 0: 55 push %ebp |
114 | 1: 89 e5 mov %esp,%ebp |
115 | 3: 83 ec 0c sub $0xc,%esp |
116 | MARK(subsys_mark1, "%d %p", 1, NULL); |
117 | 6: 0f b6 05 20 10 00 00 movzbl 0x1020,%eax |
118 | d: 84 c0 test %al,%al |
119 | f: 75 07 jne 18 <my_open+0x18> |
120 | |
121 | return -EPERM; |
122 | } |
123 | 11: b8 ff ff ff ff mov $0xffffffff,%eax |
124 | 16: c9 leave |
125 | 17: c3 ret |
126 | 18: b8 01 00 00 00 mov $0x1,%eax |
127 | 1d: e8 fc ff ff ff call 1e <my_open+0x1e> |
128 | 22: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp) |
129 | 29: 00 |
130 | 2a: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp) |
131 | 31: 00 |
132 | 32: c7 04 24 0d 00 00 00 movl $0xd,(%esp) |
133 | 39: ff 15 74 10 00 00 call *0x1074 |
134 | 3f: b8 01 00 00 00 mov $0x1,%eax |
135 | 44: e8 fc ff ff ff call 45 <my_open+0x45> |
136 | 49: eb c6 jmp 11 <my_open+0x11> |
137 | |
138 | * Size (x86) |
139 | |
140 | - Optimized |
141 | |
142 | Adds 6 bytes in the "likely" path. |
143 | Adds 32 bytes in the "unlikely" path. |
144 | |
145 | - Generic |
146 | |
147 | Adds 11 bytes in the "likely" path. |
148 | Adds 32 bytes in the "unlikely" path. |
149 | |
150 | |
151 | |
152 | Conclusion |
153 | |
154 | In an empty loop, the generic marker is faster than the optimized marker. This |
155 | may be due to better performances of the movzbl instruction over the movb on the |
156 | Pentium 4 architecture. However, when we execute a loop of 4kB copy, the impact |
157 | of the movzbl becomes greater because it uses the memory bandwidth. |
158 | |
159 | The preemption disabling and call to a probe itself costs 48.11 cycles, almost |
160 | as much as dynamically parsing the format string to get the variable arguments |
161 | (40.48 cycles). |
162 | |
163 | There is almost no difference, on x86, between passing the arguments directly on |
164 | the stack and using a variable argument list when its layout is known |
165 | statically (0.89 cycles vs 2.74 cycles). |
166 | |
167 | |
168 | |