Library optimizations and benchmarking

Recommended reading: The Rust performance book

What to optimize

It's preferred to optimize code that shows up as significant in real-world code. E.g. it's more beneficial to speed up [T]::sort than it is to shave off a small allocation in Command::spawn because the latter is dominated by its syscall cost.

Issues about slow library code are labeled as I-slow T-libs and those about code size as I-heavy T-libs

Vectorization

Currently only baseline target features (e.g. SSE2 on x86_64-unknown-linux-gnu) can be used in core and alloc because runtime feature-detection is only available in std. Where possible the preferred way to achieve vectorization is by shaping code in a way that the compiler backend's auto-vectorization passes can understand. This benefits user crates compiled with additional target features when they instantiate generic library functions, e.g. iterators.

rustc-perf

For parts of the standard library that are heavily used by rustc itself it can be convenient to use the benchmark server.

Since it only measures compile-time but not runtime performance of crates it can't be used to benchmark for features that aren't used by the compiler, e.g. floating point code, linked lists, mpsc channels, etc. For those explicit benchmarks must be written or extracted from real-world code.

Built-in Microbenchmarks

The built-in benchmarks use cargo bench and can be found in the benches directory for core and alloc and in test modules in std.

The benchmarks are automatically executed run in a loop by Bencher::iter to average the runtime over many loop-iterations. For CPU-bound microbenchmarks the runtime of a single iteration should be in the range of nano- to microseconds.

To run a specific can be invoked without recompiling rustc via ./x bench library/<lib> --stage 0 --test-args <benchmark name>.

cargo bench measures wall-time. This often is good enough, but small changes such as saving a few instructions in a bigger function can get drowned out by system noise. In such cases the following changes can make runs more reproducible:

  • disable incremental builds in config.toml
  • build std and the benchmarks with RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1"
  • ensure the system is as idle as possible
  • disable ASLR
  • pinning the benchmark process to a specific core
  • change the CPU scaling governor to a fixed-frequency one (performance or powersave)
  • disable clock boosts, especially on thermal-limited systems such as laptops

Standalone tests

If x or the cargo benchmark harness get in the way it can be useful to extract the benchmark into a separate crate, e.g. to run it under perf stat or cachegrind.

Build the standard library and link stage0-sysroot as rustup toolchain and then use that to build the standalone benchmark with a modified standard library.

If the std rebuild times are too long for fast iteration it can be useful to not only extract the benchmark but also the code under test into a separate crate.

Running under perf-record

If extracting the code into a separate crate is impractical one can first build the benchmark and then run it again under perf record and then drill down to the benchmark kernel with perf report.

# 1CGU to reduce inlining changes and code reorderings, debuginfo for source annotations
$ export RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1 -Cdebuginfo=2"

# build benchmark without running it
$ ./x bench --stage 0 library/core/ --test-args skipallbenches

# run the benchmark under perf
$ perf record --call-graph dwarf -e instructions ./x bench --stage 0 library/core/ --test-args <benchmark name>
$ perf report

By renaming perf.data to keep it from getting overwritten by subsequent runs it can be later compared to runs with a modified library with perf diff.

comparing assembly

While perf report shows assembly of the benchmark code it can sometimes be difficult to get a good overview of what changed, especially when multiple benchmarks were affected. As an alternative one can extract and diff the assembly directly from the benchmark suite.

# 1CGU to reduce inlining changes and code reorderings, debuginfo for source annotations
$ export RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1 -Cdebuginfo=2"

# build benchmark libs
$ ./x bench --stage 0 library/core/ --test-args skipallbenches

# this should print something like the following
Running benches/lib.rs (build/x86_64-unknown-linux-gnu/stage0-std/x86_64-unknown-linux-gnu/release/deps/corebenches-2199e9a22e7b1f4a)

# get the assembly for all the benchmarks
$ objdump --source --disassemble --wide --no-show-raw-insn --no-addresses \
  build/x86_64-unknown-linux-gnu/stage0-std/x86_64-unknown-linux-gnu/release/deps/corebenches-2199e9a22e7b1f4a \
  | rustfilt > baseline.asm

# switch to the branch with the changes
$ git switch feature-branch

# repeat the procedure above
$ ./x bench ...
$ objdump ... > changes.asm

# compare output
$ kdiff3 baseline.asm changes.asm

This can also be applied to standalone benchmarks.