Adding support for "compact" 32 bits events. Mathieu Desnoyers March 9, 2007 event header 32 bits header Aligned on 32 bits 1 bit to select event type 4 bits event ID 16 events (too few) 27 bits TSC (cut MSB) wraps 32 times per second at 4GHz each wraps spaced from 0.03125s 100HZ clock : tick each 0.01s detect wrap at least each 3 jiffies (dangerous, may miss) granularity : 2^0 = 1 cycle : 0.25ns @4GHz payload size known by facility 32 bits header Aligned on 32 bits 1 bit to select event type 4 bits event ID 16 events (too few) 27 bits TSC (cut LSB) wraps each second at 4GHz 100HZ clock : tick each 0.01s granularity : 2^5 = 32 cycles : 8ns @4GHz payload size known by facility 32 bits header Aligned on 32 bits 1 bit to select event type 5 bits event ID 32 events 26 bits TSC (cut LSB) wraps each 0.5 second at 4GHz 100HZ clock : tick each 0.01s granularity : 2^6 = 64 cycles : 16ns @4GHz payload size known by facility 32 bits header Aligned on 32 bits 1 bit to select event type 6 bits event ID 64 events 25 bits TSC (cut LSB) wraps each 0.5 second at 4GHz 100HZ clock : tick each 0.01s granularity : 2^7 = 128 cycles : 32ns @4GHz payload size known by facility 64 bits header Aligned on 32 bits 1 bit to select event type 15 bits event id, (major 8 minor 8) 32768 events 16 bits event size (extra) 32 bits TSC wraps each second at 4GHz 100HZ clock : tick each 0.01s 96 or 128 bits header (full 64 bits TSC, useful when no heartbeat available size depends on internal alignment) Aligned on 32 bits 1 bit to select event type 15 bits event id, (major 8 minor 8) 32768 events 16 bits event size (extra) Align on 64 bits 64 bits TSC wraps each 146.14 years at 4GHz Must put the event ID fields first in the large (64, 96-128 bits) event headers Create a "compact" facility which reserves the facility IDs with the MSB at 1. - or better : select mapping for events What is the minimum granularity required ? (so we know how much LSB to cut) - How much can synchonized CPU TSCs drift apart one from another ? PLL http://en.wikipedia.org/wiki/Phase-locked_loop static phase offset -> tracking jitter 25 MHz oscillator on motherboard for CPU jitter : expressed in ±picoseconds (should therefore be lower than 0.25ns) http://www.eetasia.com/ART_8800082274_480600_683c4e6b200103.HTM NEED MORE INFO. - What is the cacheline synchronization latency between the CPUs ? Worse case : Intel Core 2, Intel Xeon 5100, Intel core solo, intel core duo Unified L2 cache. http://www.intel.com/design/processor/manuals/253668.pdf Intel Core 2, Intel Xeon 5100 http://www.intel.com/design/processor/manuals/253665.pdf Up to 10.7 GB/s FSB http://www.xbitlabs.com/articles/mobile/display/core2duo_2.html Intel Core Duo Intel Core 2 Duo L2 cache latency 14 cycles 14 cycles (round-trip : 28 cycles) 7ns @4GHz sparc64 : between threads : shares L1 cache. suspected to be ~2 cycles total (1+1) (to check) - How close (cycle-wise) can be two consecutive recorded events in the same buffer ? (~200ns, time for logging an event) (~800 cycles @4GHz) - Tracing code itself : if it's at a subbuffer boundary, more check to do. Must see the maximum duration of a non interrupted probe. Worse case (had NMIs enabled) : 6997 cycles. 1749 ns @4GHz. TODO : test with NMIs disabled and HT disabled. Ordering can be changed if an interrupt comes between the memory operation and the tracer call. Therefore, we cannot rely on more precision than the expected interrupt handler duration. (guess : ~10000cycles, 2500ns@4GHz) - If there is a faster interconnect between the CPUs, it can be a problem, but seems to only be proprietary interconnects, not used in general. - IPI are expected to take much more than 28 cycles. What is the minimum wrap-around interval ? (must be safe for timer interrupt miss and multiple timer HZ (configurable) and CPU MHZ frequencies) Must align _all_ headers on 32 bits, not 64. Granularity : 800ns (200 cycles@4GHz) : 2^9 = 512 (remove 9 LSB) Number of LSB skipped : first_bit(probe_duration_in_cycles)-1 Min wrap : 100HZ system, each 3 timer ticks : 0.03s (32-4 MSB for 4 GHZ : 0.26s) (heartbeat each 100HZ, to be safe) Number of MSB to skip : 32 - first_bit(( (expected_longest_cli()[ms] + max_timer_interval[ms]) * 2 / cpu_khz )) 9LSB + 4MSB = 13 bits total. 12 bits for event IDs : 4096 events.