tests/markers/markers-microbench-0.2.txt

   1
   2
   3 * Microbenchmarks
   4
   5 Use timestamp counter to calculate the time spent, with interrupts disabled.
   6 Machine : Pentium 4 3GHz
   7 Fully preemptible kernel
   8 marker : MARK(subsys_mark1, "%d %p", 1, NULL);
   9 Linux Kernel Markers 0.19
  10
  11 * Execute an empty loop
  12 NR_LOOPS : 10000000
  13 time delta (cycles): 15026497
  14 cycles per loop : 1.50
  15 - i386 "optimized" : immediate value, test and predicted branch
  16   (non connected marker)
  17 NR_LOOPS : 10000000
  18 time delta (cycles): 40031640
  19 cycles per loop : 4.00
  20 cycles per loop for marker : 2.50
  21 - i386 "generic" : load, test and predicted branch
  22   (non connected marker)
  23 NR_LOOPS : 10000000
  24 time delta (cycles): 26697878
  25 cycles per loop : 2.67
  26 cycles per loop for marker : 1.17
  27
  28 * Execute a loop of memcpy 4096 bytes
  29 - Without marker
  30 NR_LOOPS : 10000
  31 time delta (cycles): 12981555
  32 cycles per loop : 1298.16
  33 - i386 "optimized" : immediate value, test and predicted branch
  34   (non connected marker)
  35 NR_LOOPS : 10000
  36 time delta (cycles): 12982290
  37 cycles per loop : 1298.23
  38 cycles per loop for marker : 0.074
  39 - i386 "generic" : load, test and predicted branch
  40   (non connected marker)
  41 NR_LOOPS : 10000
  42 time delta (cycles): 13002788
  43 cycles per loop : 1300.28
  44 cycles per loop for marker : 2.123
  45
  46
  47 The following tests are done with the "optimized" markers only
  48
  49 Execute a loop with marker enabled, with i386 "fastcall" register argument
  50         setup, probe empty.
  51 NR_LOOPS : 100000
  52 time delta (cycles): 4407608
  53 cycles per loop : 44.08
  54 cycles per loop to setup arguments in registers : 44.08-4.00=40.08
  55
  56 Execute a loop with a marker enabled, with an empty probe. Var args argument
  57         setup, probe empty.
  58 NR_LOOPS : 100000
  59 time delta (cycles): 5210587
  60 cycles per loop : 52.11
  61 additional cycles per loop to setup var args : 52.11-44.08=8.03
  62
  63 Execute a loop with marker enabled, with i386 "asmlinkage" arguments expected.
  64         Data is copied by the probe.
  65 NR_LOOPS : 100000
  66 time delta (cycles): 5299837
  67 cycles per loop : 53.00
  68 additional cycles per loop to get arguments in probe (from stack) on x86 :
  69   53.00-52.11=0.89
  70
  71 Execute a loop with marker enabled, with var args probe expecting arguments.
  72         Data is copied by the probe.
  73 NR_LOOPS : 100000
  74 time delta (cycles): 5574300
  75 cycles per loop : 55.74
  76 additional cycles per loop to get expected variable arguments on x86 :
  77   55.74-53.00=2.74
  78
  79 Execute a loop with marker enabled, with var args probe, format string
  80         Data is copied by the probe.
  81 processing.
  82 NR_LOOPS : 100000
  83 time delta (cycles): 9622117
  84 cycles per loop : 96.22
  85 additional cycles per loop to dynamically parse arguments with format string :
  86   96.22-55.74=40.48
  87
  88
  89 * Assembly code
  90
  91
  92 - Optimized
  93
  94 static int my_open(struct inode *inode, struct file *file)
  95 {
  96    0:   55                      push   %ebp
  97    1:   89 e5                   mov    %esp,%ebp
  98    3:   83 ec 0c                sub    $0xc,%esp
  99         MARK(subsys_mark1, "%d %p", 1, NULL);
 100    6:   b0 00                   mov    $0x0,%al
 101    8:   84 c0                   test   %al,%al
 102    a:   75 07                   jne    13 <my_open+0x13>
 103
 104         return -EPERM;
 105 }
 106    c:   b8 ff ff ff ff          mov    $0xffffffff,%eax
 107   11:   c9                      leave
 108   12:   c3                      ret
 109   13:   b8 01 00 00 00          mov    $0x1,%eax
 110   18:   e8 fc ff ff ff          call   19 <my_open+0x19>
 111   1d:   c7 44 24 08 00 00 00    movl   $0x0,0x8(%esp)
 112   24:   00
 113   25:   c7 44 24 04 01 00 00    movl   $0x1,0x4(%esp)
 114   2c:   00
 115   2d:   c7 04 24 0d 00 00 00    movl   $0xd,(%esp)
 116   34:   ff 15 74 10 00 00       call   *0x1074
 117   3a:   b8 01 00 00 00          mov    $0x1,%eax
 118   3f:   e8 fc ff ff ff          call   40 <my_open+0x40>
 119   44:   eb c6                   jmp    c <my_open+0xc>
 120
 121
 122 - Generic
 123
 124 static int my_open(struct inode *inode, struct file *file)
 125 {
 126    0:   55                      push   %ebp
 127    1:   89 e5                   mov    %esp,%ebp
 128    3:   83 ec 0c                sub    $0xc,%esp
 129         MARK(subsys_mark1, "%d %p", 1, NULL);
 130    6:   0f b6 05 20 10 00 00    movzbl 0x1020,%eax
 131    d:   84 c0                   test   %al,%al
 132    f:   75 07                   jne    18 <my_open+0x18>
 133
 134         return -EPERM;
 135 }
 136   11:   b8 ff ff ff ff          mov    $0xffffffff,%eax
 137   16:   c9                      leave
 138   17:   c3                      ret
 139   18:   b8 01 00 00 00          mov    $0x1,%eax
 140   1d:   e8 fc ff ff ff          call   1e <my_open+0x1e>
 141   22:   c7 44 24 08 00 00 00    movl   $0x0,0x8(%esp)
 142   29:   00
 143   2a:   c7 44 24 04 01 00 00    movl   $0x1,0x4(%esp)
 144   31:   00
 145   32:   c7 04 24 0d 00 00 00    movl   $0xd,(%esp)
 146   39:   ff 15 74 10 00 00       call   *0x1074
 147   3f:   b8 01 00 00 00          mov    $0x1,%eax
 148   44:   e8 fc ff ff ff          call   45 <my_open+0x45>
 149   49:   eb c6                   jmp    11 <my_open+0x11>
 150
 151 * Size (x86)
 152
 153 - Optimized
 154
 155 Adds 6 bytes in the "likely" path.
 156 Adds 32 bytes in the "unlikely" path.
 157
 158 - Generic
 159
 160 Adds 11 bytes in the "likely" path.
 161 Adds 32 bytes in the "unlikely" path.
 162
 163
 164 * Macrobenchmarks
 165
 166 Compiling a 2.6.17 kernel on a Pentium 4 3GHz, 1GB ram, cold cache.
 167 Running a 2.6.17 vanilla kernel :
 168 real    8m2.443s
 169 user    7m35.124s
 170 sys     0m34.950s
 171
 172 Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
 173 real    8m1.635s
 174 user    7m34.552s
 175 sys     0m36.298s
 176
 177 Ping flood on loopback interface :
 178 Running a 2.6.17 vanilla kernel :
 179 136596 packets transmitted, 136596 packets received, 0% packet loss
 180 round-trip min/avg/max = 0.0/0.0/0.1 ms
 181
 182 real    0m10.840s
 183 user    0m0.360s
 184 sys     0m10.485s
 185
 186 12601 packets transmitted/s
 187
 188 Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
 189 108504 packets transmitted, 108504 packets received, 0% packet loss
 190 round-trip min/avg/max = 0.0/0.0/0.1 ms
 191
 192 real    0m8.614s
 193 user    0m0.264s
 194 sys     0m8.353s
 195
 196 12596 packets transmitted/s
 197
 198
 199
 200 Conclusion
 201
 202 In an empty loop, the generic marker is faster than the optimized marker. This
 203 may be due to better performances of the movzbl instruction over the movb on the
 204 Pentium 4 architecture. However, when we execute a loop of 4kB copy, the impact
 205 of the movzbl becomes greater because it uses the memory bandwidth.
 206
 207 The preemption disabling and call to a probe itself costs 48.11 cycles, almost
 208 as much as dynamically parsing the format string to get the variable arguments
 209 (40.48 cycles).
 210
 211 There is almost no difference, on x86, between passing the arguments directly on
 212 the stack and using a variable argument list when its layout is known
 213 statically (0.89 cycles vs 2.74 cycles).
 214
 215
 216