Library optimizations and benchmarking
Recommended reading: The Rust performance book
What to optimize
It's preferred to optimize code that shows up as significant in real-world code.
E.g. it's more beneficial to speed up
[T]::sort than it is to shave off a small allocation in
because the latter is dominated by its syscall cost.
Issues about slow library code are labeled as I-slow T-libs and those about code size as I-heavy T-libs
Currently only baseline target features (e.g. SSE2 on x86_64-unknown-linux-gnu) can be used in core and alloc because runtime feature-detection is only available in std. Where possible the preferred way to achieve vectorization is by shaping code in a way that the compiler backend's auto-vectorization passes can understand. This benefits user crates compiled with additional target features when they instantiate generic library functions, e.g. iterators.
For parts of the standard library that are heavily used by rustc itself it can be convenient to use the benchmark server.
Since it only measures compile-time but not runtime performance of crates it can't be used to benchmark for features that aren't used by the compiler, e.g. floating point code, linked lists, mpsc channels, etc. For those explicit benchmarks must be written or extracted from real-world code.
The built-in benchmarks use cargo bench
and can be found in the
benches directory for
alloc and in
test modules in
The benchmarks are automatically executed run in a loop by
Bencher::iter to average the runtime over many loop-iterations.
For CPU-bound microbenchmarks the runtime of a single iteration should be in the range of nano- to microseconds.
To run a specific can be invoked without recompiling rustc
./x bench library/<lib> --stage 0 --test-args <benchmark name>.
cargo bench measures wall-time. This often is good enough, but small changes such as saving a few instructions
in a bigger function can get drowned out by system noise. In such cases the following changes can make runs more
- disable incremental builds in
- build std and the benchmarks with
- ensure the system is as idle as possible
- disable ASLR
- pinning the benchmark process to a specific core
- change the CPU scaling governor
to a fixed-frequency one (
- disable clock boosts, especially on thermal-limited systems such as laptops
x or the cargo benchmark harness get in the way it can be useful to extract the benchmark into a separate crate,
e.g. to run it under
perf stat or cachegrind.
Build the standard library and link stage0-sysroot as rustup toolchain and then use that to build the standalone benchmark with a modified standard library.
If the std rebuild times are too long for fast iteration it can be useful to not only extract the benchmark but also the code under test into a separate crate.
Running under perf-record
If extracting the code into a separate crate is impractical one can first build the benchmark and then run it again
perf record and then drill down to the benchmark kernel with
# 1CGU to reduce inlining changes and code reorderings, debuginfo for source annotations $ export RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1 -Cdebuginfo=2" # build benchmark without running it $ ./x bench --stage 0 library/core/ --test-args skipallbenches # run the benchmark under perf $ perf record --call-graph dwarf -e instructions ./x bench --stage 0 library/core/ --test-args <benchmark name> $ perf report
perf.data to keep it from getting overwritten by subsequent runs it can be later compared to runs with
a modified library with
perf report shows assembly of the benchmark code it can sometimes be difficult to get a good overview of what
changed, especially when multiple benchmarks were affected. As an alternative one can extract and diff the assembly
directly from the benchmark suite.
# 1CGU to reduce inlining changes and code reorderings, debuginfo for source annotations $ export RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1 -Cdebuginfo=2" # build benchmark libs $ ./x bench --stage 0 library/core/ --test-args skipallbenches # this should print something like the following Running benches/lib.rs (build/x86_64-unknown-linux-gnu/stage0-std/x86_64-unknown-linux-gnu/release/deps/corebenches-2199e9a22e7b1f4a) # get the assembly for all the benchmarks $ objdump --source --disassemble --wide --no-show-raw-insn --no-addresses \ build/x86_64-unknown-linux-gnu/stage0-std/x86_64-unknown-linux-gnu/release/deps/corebenches-2199e9a22e7b1f4a \ | rustfilt > baseline.asm # switch to the branch with the changes $ git switch feature-branch # repeat the procedure above $ ./x bench ... $ objdump ... > changes.asm # compare output $ kdiff3 baseline.asm changes.asm
This can also be applied to standalone benchmarks.