Translation Lookaside Buffer (Tlb)

Understanding the TLB and the impacts it has on application performance
🪦 End of Life. This note is no longer maintained, and is scheduled for deletion.

The transaction lookaside buffer (also known as the TLB - not to be confused with Java's Thread Local Buffer) is one of the core cache components of a modern CPU, along with the data and instruction caches. The TLB's role is to help the CPU very quickly lookup the address in physical memory of a virtualized memory page. At the hardware level, the TLB is typically constructed much like a least-recently-used cache. There are typically multiple levels of TLBs, which are sometimes divided by instructions (iTLB) and data (dTLB) in L1 caches, but tend to be unified in L2 and other caches. Note that each TLB has a limited number of entries, so to extract the highest performance, the goal is to increase the hit ratio as much as possible.

When we construct a high amount of threads (depends on the hardware and if virtualization is used, but consider this to be tens to hundreds of threads) in a system, the system is likely to suffer severe performance problems as the costs of lock contention, scheduler overheads and having to continuously refresh the TLB and lookup the correct memory page weigh the system down.

If we instead build the system in a TLB friendly manner, we can extract very high performance. As with all claims made on performance, they should be properly validated within the context of your environment.

To be TLB friendly, try to:

  • stride sequentially through your application's memory (forwards or backwards, CPU's don't mind which). Avoid random access. Cache prefetching (which is typically done at a mix of hardware and software layer) works best with sequentially accessed memory.
  • use threads with care
  • consider your object layout in memory. Using standard POJOs stored in a hashmap or similar might well involve random access. Explicitly managing your memory via off-heap structures can result in much more predictable memory layout, and thus higher and/or more stable performance.
  • if you're using virtualized hardware, being TLB friendly can significantly help application performance
  • if you're running on Linux, take a look at Transparent Huge Pages

One option to measure Java source code's TLB friendliness is to make use of Linux perf. This can be integrated into JMH, via the LinuxPerfNormProfiler. To make use of this, you will need to run on a physical Linux machine, and have installed perf.

Sample outputs from the profiler, with both dTLB and iTLB data:

Benchmark                                        Mode      Cnt       Score   Error      Units
createRfqRoundtrip:CPI                         sample     2000       0.245          clks/insn
createRfqRoundtrip:IPC                         sample     2000       4.078          insns/clk
createRfqRoundtrip:L1-dcache-load-misses       sample     2000       0.005               #/op
createRfqRoundtrip:L1-dcache-loads             sample     2000      79.518               #/op
createRfqRoundtrip:L1-dcache-stores            sample     2000      29.214               #/op
createRfqRoundtrip:L1-icache-load-misses       sample     2000       0.008               #/op
createRfqRoundtrip:LLC-load-misses             sample     200010⁻³               #/op
createRfqRoundtrip:LLC-loads                   sample     2000       0.002               #/op
createRfqRoundtrip:LLC-store-misses            sample     200010⁻⁴               #/op
createRfqRoundtrip:LLC-stores                  sample     200010⁻³               #/op
createRfqRoundtrip:branch-misses               sample     2000       0.008               #/op
createRfqRoundtrip:branches                    sample     2000      44.329               #/op
createRfqRoundtrip:cycles                      sample     2000      65.693               #/op
createRfqRoundtrip:dTLB-load-misses            sample     200010⁻⁴               #/op
createRfqRoundtrip:dTLB-loads                  sample     2000      79.381               #/op
createRfqRoundtrip:dTLB-store-misses           sample     200010⁻⁴               #/op
createRfqRoundtrip:dTLB-stores                 sample     2000      29.191               #/op
createRfqRoundtrip:iTLB-load-misses            sample     200010⁻⁴               #/op
createRfqRoundtrip:iTLB-loads                  sample     2000       0.001               #/op
createRfqRoundtrip:instructions                sample     2000     267.926               #/op


References

Benjamin J. Evans, James Gough, and Chris Newland: Optimizing Java, 2018. Published by O’Reilly Media, Inc
Brendan Gregg: Systems Performance Second Edition, 2021. Published by Pearson Education
Scott Oaks: Java Performance 2nd Edition, 2020. Published by O’Reilly Media, Inc
Vijay Nagarajan, Daniel J. Sorin, Mark D. Hill, and David A. Wood: A Primer on Memory Consistency and Cache Coherence 2nd Edition, 2020. Published by Morgan & Claypool
Brendan Gregg: perf Examples

Change log

  • Added 10 December 2020
  • Updated 13 December 2020 - added Brendan Gregg's Systems Performance book and links to perf Examples + Martin Thompson's Memory access patterns are important
  • Updated 10 March 2021 - added sample JMH LinuxPerfNormProfiler output.
Metadata
🌿
reading time
3 min read
published
2020-12-10
last updated
2021-03-10
importance
low
review policy
continuous
Topics
Hardware
--- Views

© 2009-2023 Shaun Laurens. All Rights Reserved.