The transaction lookaside buffer (also known as the TLB - not to be confused with Java's Thread Local Buffer) is one of the core cache components of a modern CPU, along with the data and instruction caches. The TLB's role is to help the CPU very quickly lookup the address in physical memory of a virtualized memory page. At the hardware level, the TLB is typically constructed much like a least-recently-used cache. There are typically multiple levels of TLBs, which are sometimes divided by instructions (iTLB) and data (dTLB) in L1 caches, but tend to be unified in L2 and other caches. Note that each TLB has a limited number of entries, so to extract the highest performance, the goal is to increase the hit ratio as much as possible.
When we construct a high amount of threads (depends on the hardware and if virtualization is used, but consider this to be tens to hundreds of threads) in a system, the system is likely to suffer severe performance problems as the costs of lock contention, scheduler overheads and having to continuously refresh the TLB and lookup the correct memory page weigh the system down.
If we instead build the system in a TLB friendly manner, we can extract very high performance. As with all claims made on performance, they should be properly validated within the context of your environment.
To be TLB friendly, try to:
- stride sequentially through your application's memory (forwards or backwards, CPU's don't mind which). Avoid random access. Cache prefetching (which is typically done at a mix of hardware and software layer) works best with sequentially accessed memory.
- use threads with care
- consider your object layout in memory. Using standard POJOs stored in a hashmap or similar might well involve random access. Explicitly managing your memory via off-heap structures can result in much more predictable memory layout, and thus higher and/or more stable performance.
- if you're using virtualized hardware, being TLB friendly can significantly help application performance
- if you're running on Linux, take a look at Transparent Huge Pages
One option to measure Java source code's TLB friendliness is to make use of Linux perf. This can be integrated into JMH, via the LinuxPerfNormProfiler. To make use of this, you will need to run on a physical Linux machine, and have installed perf.
Sample outputs from the profiler, with both dTLB and iTLB data:
Benchmark Mode Cnt Score Error UnitscreateRfqRoundtrip:CPI sample 2000 0.245 clks/insncreateRfqRoundtrip:IPC sample 2000 4.078 insns/clkcreateRfqRoundtrip:L1-dcache-load-misses sample 2000 0.005 #/opcreateRfqRoundtrip:L1-dcache-loads sample 2000 79.518 #/opcreateRfqRoundtrip:L1-dcache-stores sample 2000 29.214 #/opcreateRfqRoundtrip:L1-icache-load-misses sample 2000 0.008 #/opcreateRfqRoundtrip:LLC-load-misses sample 2000 ≈ 10⁻³ #/opcreateRfqRoundtrip:LLC-loads sample 2000 0.002 #/opcreateRfqRoundtrip:LLC-store-misses sample 2000 ≈ 10⁻⁴ #/opcreateRfqRoundtrip:LLC-stores sample 2000 ≈ 10⁻³ #/opcreateRfqRoundtrip:branch-misses sample 2000 0.008 #/opcreateRfqRoundtrip:branches sample 2000 44.329 #/opcreateRfqRoundtrip:cycles sample 2000 65.693 #/opcreateRfqRoundtrip:dTLB-load-misses sample 2000 ≈ 10⁻⁴ #/opcreateRfqRoundtrip:dTLB-loads sample 2000 79.381 #/opcreateRfqRoundtrip:dTLB-store-misses sample 2000 ≈ 10⁻⁴ #/opcreateRfqRoundtrip:dTLB-stores sample 2000 29.191 #/opcreateRfqRoundtrip:iTLB-load-misses sample 2000 ≈ 10⁻⁴ #/opcreateRfqRoundtrip:iTLB-loads sample 2000 0.001 #/opcreateRfqRoundtrip:instructions sample 2000 267.926 #/op
- Added 10 December 2020
- Updated 13 December 2020 - added Brendan Gregg's Systems Performance book and links to perf Examples + Martin Thompson's Memory access patterns are important
- Updated 10 March 2021 - added sample JMH LinuxPerfNormProfiler output.
- 🌿reading time
3 min readpublished