The Rust Performance Book
最早于2020年十一月发布
由Nicholas Nethercote和社区共同编写
简介
性能对大多rust程序来说都很重要。
本书含有一些可改进Rust程序的性能(速度和内存占用)的方法。 编译时间中则有一些改进rust程序的编译时间的技巧。 其中一些只需要修改编译配置,但很多都需要修改代码。
注意,有很多技巧完全仅限于Rust,也有一些思路可以应用到其它语言的程序。 通用提示中有一些通用的概念,理论上可用于任何编程语言。但是,这本书主要是关于Rust程序的性能, 并不是通用的性能分析与优化教程。
本书还重点介绍了被实践证明的技术,通常会给出到拉取请求或其他资源的链接, 以显示这种技术是如何在Rust程序中应用的。
本书面向入门以上的Rust开发者。Rust初学者要学很多东西,过早关注这些可能造成无益的分心。
基准测试
基准测试通常用于比较功能相同的程序的性能。有时要比较不同的程序,比如,火狐、Chrome、Safari。 有时则要比较一个程序的不同版本。这可以验证“这个修改会让它更快吗?”。
基准测试是一个复杂的话题,彻底的覆盖超出了本书的范围,但这里有一些基本知识。
首先,你需要根据工作负载来衡量。理想情况下,你会有各种工作负载来表示你的程序的实际使用情况。 使用真实世界的输入的工作负载是最好的,但微基准测试(microbenchmarks)和压力测试(stress tests)也可以适度地发挥作用。
其次,你需要一种方法来运行工作负载,这也将决定使用的指标。
Rust 的内置基准测试(benchmark tests)是一个简单的起点,
但它们使用的是不稳定的特性(unstable features),而且只适用于 Nightly Rust。
bencher
crate 与其类似,但它适用于稳定的 Rust。Criterion 是一个更复杂的选择。
自定义 benchmark harness 也是可能的。例如,rustc-perf 是用于对 Rust 编译器进行基准测试的 harness。
当涉及到指标时,有许多选择,正确的选择将取决于被测程序的性质。 例如,对批处理程序有意义的指标可能对交互式程序没有意义。 挂钟时间(Wall-time) 在许多情况下是一个明显的选择,因为它与用户的感觉相吻合。 然而,它可能会受到高突变(high variance)的影响。特别是,内存布局的微小变化会导致显著但短暂的性能波动。 因此,其他突变程度较低的指标(如周期或指令计数)可能是一个合理的选择。
总结多个工作负载的测量结果也是一个挑战,有多种方法可以做到这一点,没有一种方法是明显最好的。
做好基准测试是很难的。说到这里,不要过分强调有一个完美的基准测试设置,特别是当你开始优化一个程序时。 一个平庸的设置要比没有设置好得多。对你正在测量的东西保持开放的心态,随着时间的推移,你可以在了解你的程序的性能特征时对基准测试进行改进。
编译配置
by @sinsong
好的构建配置可以提升 Rust 程序的性能,不需要修改程序代码。
Release 构建
Rust 性能建议中最重要的一点很简单,但也 容易被忽视: 当你需要更好的性能的时候,确保你使用 release 构建,而不是 debug 构建。
通常给 Cargo 指定 --release
标志就可以做到。
release 构建运行的通常比 debug 构建快 很多。 比 debug 构建快 10-100 倍很正常。
默认的是debug构建——
如果你运行 cargo build
,cargo run
,rustc
,并且不带其他选项,就会生成 debug 构建。
debug 构建对调试很有用,但是并不优化。
看看 cargo build
运行后输出的最后一行:
Finished dev [unoptimized + debuginfo] target(s) in 29.80s
[unoptimized + debuginfo]
表示生成的是 debug 构建。
编译后的代码会放在 target/debug/
目录中。
cargo run
会运行 debug 构建的程序。
release 构建相较于 debug 构建,会有更多优化。
也会忽略一些检查,例如调试断言(debug assertions),以及整数溢出检查。
可以用 cargo build --release
,cargo run --release
,rustc -O
生成。
(另外,rustc
有许多其他优化选项,例如 -C opt-level
。)
因为额外的优化,这通常会比 debug 构建花费更长的时间。
看看 cargo build --release
运行后输出的最后一行:
Finished release [optimized] target(s) in 1m 01s
[optimized]
表示生成的是 release 构建。
编译好的代码会放在 target/release/
目录中。
cargo run --release
会运行 release 构建。
如果想要进一步了解 debug 构建(使用 dev
profile)和 release 构建(使用 release
profile)之间的区别,可以参考 Cargo profile文档。
链接时优化
链接时优化 (Link-time optimization, LTO) 是一种适用于整个程序的优化技术, 以增加构建时间为代价,可以提高 10%-20% 或更多的运行时性能, 对于单个 Rust 程序,通常用编译时间换取运行性能是值得的。
启用 LTO 最简单的方法是,向 Cargo.toml
中添加下列行,然后进行 release 构建。
[profile.release]
lto = true
这样会启用 “重量级”(fat) LTO,会优化依赖图中的所有 crate。
另外,在 Cargo.toml
中使用 lto = "thin"
则会启用 “轻量级”(thin) LTO——一种不那么激进的 LTO 形式,通常与 重量级 LTO 一样有效,但不会过多增加构建时间。
你可以通过 Cargo LTO文档 深入了解 lto
设置,以及如何对不同 profile 启用特定设置。
Codegen Units
Rust 编译器将 crate 拆分为多个 代码生成单元 来并行化(同时加速)编译。 然而,这会导致它错过一些可能的优化。 如果你想要以更长的编译时间为代价,提升运行时性能,你可以将单元数设置为 1:
[profile.release]
codegen-units = 1
示例.
请注意,代码生成单元的数量是启发式的,以至于更小的数量可能导致实际产生的程序变慢。
使用 CPU 特定的指令
如果你并不在意你的二进制程序代码在更老(或其他类型的)处理器上的兼容性,你可以告诉编译器生成指定的 特定 CPU 架构 上的,最新(并且可能最快)的指令。
例如,如果你向 rustc 传递 -C target-cpu=native
,他会为你当前 CPU 使用最合适的指令:
RUSTFLAGS="-C target-cpu=native" cargo build --release
这可能产生很大的影响,特别是当编译器发现了你代码中进行矢量化的机会。
截止 2022 年 7 月,在 M1 Macs 上使用 -C target-cpu=native
会有,没有检测到所有 CPU 特性的 问题。
你可以使用 -C target-cpu=apple-m1
作为替代。
如果你不确定 -C target-cpu=native
是否工作最佳,可以比较 rustc --print cfg
和 rustc --print cfg -C target-cpu=native
的输出,来检查后者是否正确检测 CPU 特性。
如果没有,你可以使用 -C target-feature
来指定特定特性。
panic!
时 abort
如果你不需要捕获或展开 panic,你可以告诉编译器在 panic 时简单的 abort。 这可以减少二进制文件体积,略微增加性能:
[profile.release]
panic = "abort"
Profile-guided Optimization
Profile-guided optimization (PGO)是一种编译模型—— 编译你的程序,用采样数据运行并收集性能分析数据,然后基于这些数据引导改程序的下次编译。 Example.
这是一种较高级的技术,需要花一些精力设置,但有时值得这样做。 详见rustc PGO 文档。
Linting
clippy 是一个lint集合,可以帮你发现Rust代码中常见的错误。 clippy很不错,甚至可能可以帮你提升性能——有些涉及到代码模式的lint,可能会导致次优性能。
基本使用
安装后,运行很简单:
cargo clippy
性能lint的完整列表可以在这里查阅,把Perf以外的组全部取消勾选即可。
让代码运行更快的同时,性能lint建议通常也会让你的代码更简洁,更地道。 因此它们值得遵循,哪怕并不是频繁执行的代码。
Disallowing Types
在接下来的章节我们会了解到,有时我们可以使用比标准库类型更快的替代。 使用时,你很可能在哪里不小心使用了标准库类型。
你可以使用 clippy
的 disallowed_types
lint(自Rust 1.55加入),避免这个问题。
比如,你不想使用标准库的散列表(原因在Hashing小节),可以将以下内容保存到你的项目中的 clippy.toml
文件:
disallowed-types = ["std::collections::HashMap", "std::collections::HashSet"]
然后将以下声明加入到你的Rust代码中:
#![allow(unused)] #![warn(clippy::disallowed_types)] fn main() { }
你需要手动加上这一行,因为 disallow_types
目前是一个 “nursery”(正在开发)的lint。
性能分析
在优化程序时,你需要判断你的程序哪些部分是“热点”(执行得足够频繁,足以影响运行时间),值得修改。 可以使用性能分析器判断。
性能分析器(Profilers)
有许多不同的性能分析器,每一个都有其优势和劣势。 下面是一个不完整的清单,其中包括已经成功用于 Rust 程序的性能分析器。
- perf 是一个通用的性能分析器,它使用 hardware performance counters (HPC)。 Hotspot 和 Firefox Profiler 是查看 perf 记录的数据的好工具。它在 Linux 上工作。
- Instruments 是一个通用的性能分析器,在 macOS 的 Xcode 中自带。
- AMD μProf 是一个通用的性能分析器,在 Windows 和 Linux 上工作。
- flamegraph 是一个 Cargo 命令,它使用 perf/DTrace 对你的代码进行分析,然后以火焰图的形式显示结果。 它适用于 Linux 和所有支持 DTrace 的平台(macOS、FreeBSD、NetBSD,可能还有 Windows)。
- Cachegrind 和 Callgrind 提供了全局的、每个函数的和每行源码的指令计数(instruction counts) 以及模拟的缓存(simulated cache)和分支预测数据(branch prediction data)。 它们可以在 Linux 和其他一些 Unix 系统上运行。
- DHAT 可帮你找到代码中哪些部分导致了大量的分配,并了解内存使用峰值。它还可以识别对 memcpy 的热调用(hot calls)。 dhat-rs 是一个实验性的替代品,功能稍差,需要对你的 Rust 程序做一些小的改动,但它支持任何平台。
- heaptrack 和 bytehound 是堆分析工具。它们在 Linux 上工作
counts
支持 ad-hoc 分析,它结合使用eprintln!
语句和基于频率的后处理,这对获得你的代码部分的领域特定(domain-specific)的亮点很有好处。它适用于所有平台。- Coz 执行 causal profiling 以衡量优化潜力,并通过 coz-rs 支持 Rust。它在 Linux 上工作。
调试信息(Debug Info)
为了有效地分析 release 的构建产物,你可能需要启用源码行的调试信息。 要做到这一点,请在你的 Cargo.toml 文件中添加以下几行。
[profile.release]
debug = 1
debug
设置详见Cargo 文档。
然而,即使做了上述步骤,你也无法得到标准库的详细性能分析信息。
这是因为 Rust 标准库发布版本不携带调试信息。你可以用这里的方法自己编译一份标准库与编译器,并将如下行添加到 config.toml
:
[rust]
debuginfo-level = 1
这有些麻烦,但有时值得这么做。
符号Demangling
Rust 使用一种mangle机制,对生成的代码中的符号名进行编码。
如果性能分析器不支持这种机制,它的输出可能包含以 _ZN
或 _R
开头的符号名称,
例如 _ZN3foo3barE
或 _ZN28_$u7b$$u7b$closure$u7d$$u7d$E
或 _RMCsno73SFvQKx_1cINtB0_3StrKRe616263_E
这种符号名可以用rustfilt
手动demangle。
内联优化
进入与退出一个未内联的热点函数,往往会占用执行时间不可忽略的一部分。 内联这些函数可以得到小但简单的速度提升。
Rust函数的内联属性有四种:
- 无 - 编译器会自己决定该函数是否应该内联——根据优化等级,它的大小,等等。
- 如果你没有使用链接时优化,函数永远不会跨crate内联。
#[inline]
- 建议内联该函数,即使跨crate#[inline(always)]
- 非常建议内联该函数,包括跨crate#[inline(never)]
- 不建议内联该函数
内敛属性并不保证函数是否被内联,但是实际上,除了极少的个例,#[inline(always)]
后都会发生内联。
简单的情况
The best candidates for inlining are (a) functions that are very small, or (b) functions that have a single call site. The compiler will often inline these functions itself even without an inline attribute. But the compiler cannot always make the best choices, so attributes are sometimes needed. Example 1, Example 2, Example 3, Example 4, Example 5.
Cachegrind is a good profiler for determining if a function is inlined. When looking at Cachegrind’s output, you can tell that a function has been inlined if (and only if) its first and last lines are not marked with event counts. For example:
. #[inline(always)]
. fn inlined(x: u32, y: u32) -> u32 {
700,000 eprintln!("inlined: {} + {}", x, y);
200,000 x + y
. }
.
. #[inline(never)]
400,000 fn not_inlined(x: u32, y: u32) -> u32 {
700,000 eprintln!("not_inlined: {} + {}", x, y);
200,000 x + y
200,000 }
You should measure again after adding inline attributes, because the effects can be unpredictable. Sometimes it has no effect because a nearby function that was previously inlined no longer is. Sometimes it slows the code down. Inlining can also affect compile times, especially cross-crate inlining which involves duplicating internal representations of the functions.
Harder Cases
Sometimes you have a function that is large and has multiple call sites, but only one call site is hot. You would like to inline the hot call site for speed, but not inline the cold call sites to avoid unnecessary code bloat. The way to handle this is to split the function always-inlined and never-inlined variants, with the latter calling the former.
For example, this function:
#![allow(unused)] fn main() { fn one() {}; fn two() {}; fn three() {}; fn my_function() { one(); two(); three(); } }
Would become these two functions:
#![allow(unused)] fn main() { fn one() {}; fn two() {}; fn three() {}; // Use this at the hot call site. #[inline(always)] fn inlined_my_function() { one(); two(); three(); } // Use this at the cold call sites. #[inline(never)] fn uninlined_my_function() { inlined_my_function(); } }
散列
HashSet
和 HashMap
是很常用的类型。默认的散列算法并未指定,不过目前rust使用 SipHash 1-3,
这是种高质量的算法,提供很好的碰撞保护——也正因如此,它很慢,特别是使用短键(如整数)时。
如果性能分析显示散列是热点,而且你的应用不需要担心HashDoS攻击, 你可以使用散列算法更快的散列表,得到很大的性能提升。
rustc-hash
providesFxHashSet
andFxHashMap
types that are drop-in replacements forHashSet
andHashMap
. Its hashing algorithm is low-quality but very fast, especially for integer keys, and has been found to out-perform all other hash algorithms within rustc. (fxhash
is an older, less well maintained implementation of the same algorithm and types.)fnv
providesFnvHashSet
andFnvHashMap
types. Its hashing algorithm is higher quality thanrustc-hash
’s but a little slower.ahash
providesAHashSet
andAHashMap
. Its hashing algorithm can take advantage of AES instruction support that is available on some processors.
If hashing performance is important in your program, it is worth trying more than one of these alternatives. For example, the following results were seen in rustc.
- The switch from
fnv
tofxhash
gave speedups of up to 6%. - An attempt to switch from
fxhash
toahash
resulted in slowdowns of 1-4%. - An attempt to switch from
fxhash
back to the default hasher resulted in slowdowns ranging from 4-84%!
If you decide to universally use one of the alternatives, such as
FxHashSet
/FxHashMap
, it is easy to accidentally use HashSet
/HashMap
in
some places. You can use clippy
to avoid this problem.
一些类型不需要散列。例如,你可能用newtype包装了一个随机(或者接近随机)的整数。
对于这种类型,分发他们自己的值和分发散列后得到的值没有区别。
这种情况下你可以试试 nohash_hasher
。
散列函数设计是一个复杂的话题,超出本书的范围。ahash
的文档有相关的讨论。
堆分配
堆分配不总是昂贵的,具体情况取决于使用了哪种分配器。但通常每次分配/释放都涉及获取全局锁, 做一些复杂的数据结构操作,甚至可能需要执行一次系统调用。小的分配也不一定比大的分配廉价。 我们应该知道哪些Rust数据结构(的哪些操作)可能导致堆分配,因为消除堆分配可以很大提升性能。
Rust Container Cheat Sheet有可视化的Rust常用类型,是对接下来的章节很好的补充。
性能分析
如果性能分析显示 malloc
,free
以及相关的函数为热点,你可能需要尝试减少分配率,或者换用其他分配器。
DHAT is an excellent profiler to use when reducing allocation rates. It works on Linux and some other Unixes. It precisely identifies hot allocation sites and their allocation rates. Exact results will vary, but experience with rustc has shown that reducing allocation rates by 10 allocations per million instructions executed can have measurable performance improvements (e.g. ~1%).
Here is some example output from DHAT.
AP 1.1/25 (2 children) {
Total: 54,533,440 bytes (4.02%, 2,714.28/Minstr) in 458,839 blocks (7.72%, 22.84/Minstr), avg size 118.85 bytes, avg lifetime 1,127,259,403.64 instrs (5.61% of program duration)
At t-gmax: 0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
At t-end: 0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
Reads: 15,993,012 bytes (0.29%, 796.02/Minstr), 0.29/byte
Writes: 20,974,752 bytes (1.03%, 1,043.97/Minstr), 0.38/byte
Allocated at {
#1: 0x95CACC9: alloc (alloc.rs:72)
#2: 0x95CACC9: alloc (alloc.rs:148)
#3: 0x95CACC9: reserve_internal<syntax::tokenstream::TokenStream,alloc::alloc::Global> (raw_vec.rs:669)
#4: 0x95CACC9: reserve<syntax::tokenstream::TokenStream,alloc::alloc::Global> (raw_vec.rs:492)
#5: 0x95CACC9: reserve<syntax::tokenstream::TokenStream> (vec.rs:460)
#6: 0x95CACC9: push<syntax::tokenstream::TokenStream> (vec.rs:989)
#7: 0x95CACC9: parse_token_trees_until_close_delim (tokentrees.rs:27)
#8: 0x95CACC9: syntax::parse::lexer::tokentrees::<impl syntax::parse::lexer::StringReader<'a>>::parse_token_tree (tokentrees.rs:81)
}
}
It is beyond the scope of this book to describe everything in this example, but it should be clear that DHAT gives a wealth of information about allocations, such as where and how often they happen, how big they are, how long they live for, and how often they are accessed.
Box
Box
is the simplest heap-allocated type. A Box<T>
value is a T
value
that is allocated on the heap.
It is sometimes worth boxing one or more fields in a struct or enum fields to make a type smaller. (See the Type Sizes chapter for more about this.)
Other than that, Box
is straightforward and does not offer much scope for
optimizations.
Rc
/Arc
Rc
/Arc
are similar to Box
, but the value on the heap is accompanied by
two reference counts. They allow value sharing, which can be an effective way
to reduce memory usage.
However, if used for values that are rarely shared, they can increase allocation rates by heap allocating values that might otherwise not be heap-allocated. Example.
Unlike Box
, calling clone
on an Rc
/Arc
value does not involve an
allocation. Instead, it merely increments a reference count.
Vec
Vec
is a heap-allocated type with a great deal of scope for optimizing the
number of allocations, and/or minimizing the amount of wasted space. To do this
requires understanding how its elements are stored.
A Vec
contains three words: a length, a capacity, and a pointer. The pointer
will point to heap-allocated memory if the capacity is nonzero and the element
size is nonzero; otherwise, it will not point to allocated memory.
Even if the Vec
itself is not heap-allocated, the elements (if present and
nonzero-sized) always will be. If nonzero-sized elements are present, the
memory holding those elements may be larger than necessary, providing space for
additional future elements. The number of elements present is the length, and
the number of elements that could be held without reallocating is the capacity.
When the vector needs to grow beyond its current capacity, the elements will be copied into a larger heap allocation, and the old heap allocation will be freed.
Vec
growth
A new, empty Vec
created by the common means
(vec![]
or Vec::new
or Vec::default
) has a length and capacity of zero, and no
heap allocation is required. If you repeatedly push individual elements onto
the end of the Vec
, it will periodically reallocate. The growth strategy is
not specified, but at the time of writing it uses a quasi-doubling strategy
resulting in the following capacities: 0, 4, 8, 16, 32, 64, and so on. (It
skips directly from 0 to 4, instead of going via 1 and 2, because this avoids
many allocations in practice.) As a vector grows, the frequency of
reallocations will decrease exponentially, but the amount of possibly-wasted
excess capacity will increase exponentially.
This growth strategy is typical for growable data structures and reasonable in
the general case, but if you know in advance the likely length of a vector you
can often do better. If you have a hot vector allocation site (e.g. a hot
Vec::push
call), it is worth using eprintln!
to print the vector length
at that site and then doing some post-processing (e.g. with counts
) to
determine the length distribution. For example, you might have many short
vectors, or you might have a smaller number of very long vectors, and the best
way to optimize the allocation site will vary accordingly.
Short Vec
s
If you have many short vectors, you can use the SmallVec
type from the
smallvec
crate. SmallVec<[T; N]>
is a drop-in replacement for Vec
that
can store N
elements within the SmallVec
itself, and then switches to a
heap allocation if the number of elements exceeds that. (Note also that
vec![]
literals must be replaced with smallvec![]
literals.)
Example 1,
Example 2.
SmallVec
reliably reduces the allocation rate when used appropriately, but
its use does not guarantee improved performance. It is slightly slower than
Vec
for normal operations because it must always check if the elements are
heap-allocated or not. Also, If N
is high or T
is large, then the
SmallVec<[T; N]>
itself can be larger than Vec<T>
, and copying of
SmallVec
values will be slower. As always, benchmarking is required to
confirm that an optimization is effective.
If you have many short vectors and you precisely know their maximum length,
ArrayVec
from the arrayvec
crate is a better choice than SmallVec
. It
does not require the fallback to heap allocation, which makes it a little
faster.
Example.
Longer Vec
s
If you know the minimum or exact size of a vector, you can reserve a specific
capacity with Vec::with_capacity
, Vec::reserve
, or
Vec::reserve_exact
. For example, if you know a vector will grow to have at
least 20 elements, these functions can immediately provide a vector with a
capacity of at least 20 using a single allocation, whereas pushing the items
one at a time would result in four allocations (for capacities of 4, 8, 16, and
32).
Example.
If you know the maximum length of a vector, the above functions also let you
not allocate excess space unnecessarily. Similarly, Vec::shrink_to_fit
can be
used to minimize wasted space, but note that it may cause a reallocation.
String
A String
contains heap-allocated bytes. The representation and operation of
String
are very similar to that of Vec<u8>
. Many Vec
methods relating to
growth and capacity have equivalents for String
, such as
String::with_capacity
.
The SmallString
type from the smallstr
crate is similar to the SmallVec
type.
The String
type from the smartstring
crate is a drop-in replacement for
String
that avoids heap allocations for strings with less than three words’
worth of characters. On 64-bit platforms, this is any string that is less than
24 bytes, which includes all strings containing 23 or fewer ASCII characters.
Example.
Note that the format!
macro produces a String
, which means it performs an
allocation. If you can avoid a format!
call by using a string literal, that
will avoid this allocation.
Example.
std::format_args
and/or the lazy_format
crate may help with this.
Hash tables
HashSet
and HashMap
are hash tables. Their representation and
operations are similar to those of Vec
, in terms of allocations: they have
a single contiguous heap allocation, holding keys and values, which is
reallocated as necessary as the table grows. Many Vec
methods relating to
growth and capacity have equivalents for HashSet
/HashMap
, such as
HashSet::with_capacity
.
Cow
Sometimes you have some borrowed data, such as a &str
, that is mostly
read-only but occasionally needs to be modified. Cloning the data every time
would be wasteful. Instead you can use “clone-on-write” semantics via the
Cow
type, which can represent both borrowed and owned data.
Typically, when starting with a borrowed value x
you wrap it in a Cow
with
Cow::Borrowed(x)
. Because Cow
implements Deref
, you can call
non-mutating methods directly on the data it encloses. If mutation is desired,
Cow::to_mut
will obtain a mutable reference to an owned value, cloning if
necessary.
Cow
can be fiddly to get working, but it is often worth the effort.
Example 1,
Example 2,
Example 3,
Example 4.
clone
Calling clone
on a value that contains heap-allocated memory typically
involves additional allocations. For example, calling clone
on a non-empty
Vec
requires a new allocation for the elements (but note that the capacity of
the new Vec
might not be the same as the capacity of the original Vec
). The
exception is Rc
/Arc
, where a clone
call just increments the reference
count.
clone_from
is an alternative to clone
. a.clone_from(&b)
is equivalent
to a = b.clone()
but may avoid unnecessary allocations. For example, if you
want to clone one Vec
over the top of an existing Vec
, the existing Vec
’s
heap allocation will be reused if possible, as the following example shows.
#![allow(unused)] fn main() { let mut v1: Vec<u32> = Vec::with_capacity(99); let v2: Vec<u32> = vec![1, 2, 3]; v1.clone_from(&v2); // v1's allocation is reused assert_eq!(v1.capacity(), 99); }
Although clone
usually causes allocations, it is a reasonable thing to use in
many circumstances and can often make code simpler. Use profiling data to see
which clone
calls are hot and worth taking the effort to avoid.
Sometimes Rust code ends up containing unnecessary clone
calls, due to (a)
programmer error, or (b) changes in the code that render previously-necessary
clone
calls unnecessary. If you see a hot clone
call that does not seem
necessary, sometimes it can simply be removed.
Example 1,
Example 2,
Example 3.
to_owned
ToOwned::to_owned
is implemented for many common types. It creates owned
data from borrowed data, usually by cloning, and therefore often causes heap
allocations. For example, it can be used to create a String
from a &str
.
Sometimes to_owned
calls can be avoided by storing a reference to borrowed
data in a struct rather than an owned copy. This requires lifetime annotations
on the struct, complicating the code, and should only be done when profiling
and benchmarking shows that it is worthwhile.
Example.
Reusing Collections
Sometimes you need to build up a collection such as a Vec
in stages. It is
usually better to do this by modifying a single Vec
than by building multiple
Vec
s and then combining them.
For example, if you have a function do_stuff
that produces a Vec
that might
be called multiple times:
#![allow(unused)] fn main() { fn do_stuff(x: u32, y: u32) -> Vec<u32> { vec![x, y] } }
It might be better to instead modify a passed-in Vec
:
#![allow(unused)] fn main() { fn do_stuff(x: u32, y: u32, vec: &mut Vec<u32>) { vec.push(x); vec.push(y); } }
Sometimes it is worth keeping around a “workhorse” collection that can be
reused. For example, if a Vec
is needed for each iteration of a loop, you
could declare the Vec
outside the loop, use it within the loop body, and then
call clear
at the end of the loop body (to empty the Vec
without affecting
its capacity). This avoids allocations at the cost of obscuring the fact that
each iteration’s usage of the Vec
is unrelated to the others.
Example 1,
Example 2.
Similarly, it is sometimes worth keeping a “workhorse” collection within a struct, to be reused in one or more methods that are called repeatedly.
Using an Alternative Allocator
Another option for improving the performance of allocation-heavy Rust programs is to replace the default (system) allocator with an alternative allocator. The exact effect will depend on the individual program and the alternative allocator chosen, but large improvements in speed and very large reductions in memory usage have been seen in practice. The effect will also vary across platforms, because each platform’s system allocator has its own strengths and weaknesses. The use of an alternative allocator can also affect binary size.
One popular alternative allocator for Linux and Mac is jemalloc, usable via
the tikv-jemallocator
crate. To use it, add a dependency to your
Cargo.toml
file:
[dependencies]
tikv-jemallocator = "0.4.0"
Then add the following somewhere in your Rust code:
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
tikv-jemallocator
is a fork of the jemallocator
crate. As of December
2021, tikv-jemallocator
uses a jemalloc version that is newer and has better
performance than the jemalloc version used by jemallocator
.
Another alternative allocator that works on many platforms is mimalloc,
usable via the mimalloc
crate.
Avoiding Regressions
To ensure the number and/or size of allocations done by your code doesn’t increase unintentionally, you can use the heap usage testing feature of dhat-rs to write tests that check particular code snippets allocate the expected amount of heap memory.
类型大小
Shrinking oft-instantiated types can help performance.
For example, if memory usage is high, a heap profiler like DHAT can identify the hot allocation points and the types involved. Shrinking these types can reduce peak memory usage, and possibly improve performance by reducing memory traffic and cache pressure.
Furthermore, Rust types that are larger than 128 bytes are copied with memcpy
rather than inline code. If memcpy
shows up in non-trivial amounts in
profiles, DHAT’s “copy profiling” mode will tell you exactly where the hot
memcpy
calls are and the types involved. Shrinking these types to 128 bytes
or less can make the code faster by avoiding memcpy
calls and reducing memory
traffic.
Measuring Type Sizes
std::mem::size_of
gives the size of a type, in bytes, but often you want to
know the exact layout as well. For example, an enum might be surprisingly large
due to a single outsized variant.
The -Zprint-type-sizes
option does exactly this. It isn’t enabled on release
versions of rustc, so you’ll need to use a nightly version of rustc. Here is
one possible invocation via Cargo:
RUSTFLAGS=-Zprint-type-sizes cargo +nightly build --release
And here is a possible invocation of rustc:
rustc +nightly -Zprint-type-sizes input.rs
It will print out details of the size, layout, and alignment of all types in use. For example, for this type:
#![allow(unused)] fn main() { enum E { A, B(i32), C(u64, u8, u64, u8), D(Vec<u32>), } }
it prints the following, plus information about a few built-in types.
print-type-size type: `E`: 32 bytes, alignment: 8 bytes
print-type-size discriminant: 1 bytes
print-type-size variant `D`: 31 bytes
print-type-size padding: 7 bytes
print-type-size field `.0`: 24 bytes, alignment: 8 bytes
print-type-size variant `C`: 23 bytes
print-type-size field `.1`: 1 bytes
print-type-size field `.3`: 1 bytes
print-type-size padding: 5 bytes
print-type-size field `.0`: 8 bytes, alignment: 8 bytes
print-type-size field `.2`: 8 bytes
print-type-size variant `B`: 7 bytes
print-type-size padding: 3 bytes
print-type-size field `.0`: 4 bytes, alignment: 4 bytes
print-type-size variant `A`: 0 bytes
The output shows the following.
- The size and alignment of the type.
- For enums, the size of the discriminant.
- For enums, the size of each variant (sorted from largest to smallest).
- The size, alignment, and ordering of all fields. (Note that the compiler has
reordered variant
C
’s fields to minimize the size ofE
.) - The size and location of all padding.
Once you know the layout of a hot type, there are multiple ways to shrink it.
Field Ordering
The Rust compiler automatically sorts the fields in struct and enums to
minimize their sizes (unless the #[repr(C)]
attribute is specified), so you
do not have to worry about field ordering. But there are other ways to minimize
the size of hot types.
Smaller Enums
If an enum has an outsized variant, consider boxing one or more fields. For example, you could change this type:
#![allow(unused)] fn main() { type LargeType = [u8; 100]; enum A { X, Y(i32), Z(i32, LargeType), } }
to this:
#![allow(unused)] fn main() { type LargeType = [u8; 100]; enum A { X, Y(i32), Z(Box<(i32, LargeType)>), } }
This reduces the type size at the cost of requiring an extra heap allocation
for the A::Z
variant. This is more likely to be a net performance win if the
A::Z
variant is relatively rare. The Box
will also make A::Z
slightly
less ergonomic to use, especially in match
patterns.
Example 1,
Example 2,
Example 3,
Example 4,
Example 5,
Example 6.
Smaller Integers
It is often possible to shrink types by using smaller integer types. For
example, while it is most natural to use usize
for indices, it is often
reasonable to stores indices as u32
, u16
, or even u8
, and then coerce to
usize
at use points.
Example 1,
Example 2.
Boxed Slices
Rust vectors contain three words: a length, a capacity, and a pointer. If you
have a vector that is unlikely to be changed in the future, you can convert it
to a boxed slice with Vec::into_boxed_slice
. A boxed slice contains only
two words, a length and a pointer. Any excess element capacity is dropped,
which may cause a reallocation.
#![allow(unused)] fn main() { use std::mem::{size_of, size_of_val}; let v: Vec<u32> = vec![1, 2, 3]; assert_eq!(size_of_val(&v), 3 * size_of::<usize>()); let bs: Box<[u32]> = v.into_boxed_slice(); assert_eq!(size_of_val(&bs), 2 * size_of::<usize>()); }
The boxed slice can be converted back to a vector with slice::into_vec
without any cloning or a reallocation.
ThinVec
An alternative to boxed slices is ThinVec
, from the thin_vec
crate. It is
functionally equivalent to Vec
, but stores the length and capacity in the
same allocation as the elements (if there are any). This means that
size_of::<ThinVec<T>>
is only one word.
ThinVec
is a good choice within oft-instantiated types for vectors that are
often empty. It can also be used to shrink the largest variant of an enum, if
that variant contains a Vec
.
Avoiding Regressions
If a type is hot enough that its size can affect performance, it is a good idea
to use a static assertion to ensure that it does not accidentally regress. The
following example uses a macro from the static_assertions
crate.
// This type is used a lot. Make sure it doesn't unintentionally get bigger.
#[cfg(target_arch = "x86_64")]
static_assertions::assert_eq_size!(HotType, [u8; 64]);
The cfg
attribute is important, because type sizes can vary on different
platforms. Restricting the assertion to x86_64
(which is typically the most
widely-used platform) is likely to be good enough to prevent regressions in
practice.
标准库类型
It is worth reading through the documentation for common standard library
types—such as Box
, Vec
, Option
, Result
, and Rc
/Arc
—to find interesting
functions that can sometimes be used to improve performance.
It is also worth knowing about high-performance alternatives to standard
library types, such as Mutex
, RwLock
, Condvar
, and
Once
.
Box
The expression Box::default()
has the same effect as
Box::new(T::default())
but can be faster because the compiler can create the
value directly on the heap, rather than constructing it on the stack and then
copying it over.
Example.
Vec
Vec::remove
removes an element at a particular index and shifts all
subsequent elements one to the left, which makes it O(n). Vec::swap_remove
replaces an element at a particular index with the final element, which does
not preserve ordering, but is O(1).
Vec::retain
efficiently removes multiple items from a Vec
. There is an
equivalent method for other collection types such as String
, HashSet
, and
HashMap
.
Option
and Result
Option::ok_or
converts an Option
into a Result
, and is passed an err
parameter that is used if the Option
value is None
. err
is computed
eagerly. If its computation is expensive, you should instead use
Option::ok_or_else
, which computes the error value lazily via a closure.
For example, this:
#![allow(unused)] fn main() { fn expensive() {} let o: Option<u32> = None; let r = o.ok_or(expensive()); // always evaluates `expensive()` }
should be changed to this:
#![allow(unused)] fn main() { fn expensive() {} let o: Option<u32> = None; let r = o.ok_or_else(|| expensive()); // evaluates `expensive()` only when needed }
There are similar alternatives for Option::map_or
, Option::unwrap_or
,
Result::or
, Result::map_or
, and Result::unwrap_or
.
Rc
/Arc
Rc::make_mut
/Arc::make_mut
provide clone-on-write semantics. They make
a mutable reference to an Rc
/Arc
. If the refcount is greater than one, they
will clone
the inner value to ensure unique ownership; otherwise, they will
modify the original value. They are not needed often, but they can be extremely
useful on occasion.
Example 1,
Example 2.
Mutex
, RwLock
, Condvar
, and Once
The parking_lot
crate provides alternative implementations of these
synchronization types. The APIs and semantics of the parking_lot
types are
similar but not identical to those of the equivalent types in the standard
library.
The parking_lot
versions used to be reliably smaller, faster, and more
flexible than those in the standard library, but the standard library versions
have greatly improved on some platforms. So you should measure before switching
to parking_lot
.
If you decide to universally use the parking_lot
types it is easy to
accidentally use the standard library equivalents in some places. You can use
clippy
to avoid this problem.
迭代器
collect
and extend
Iterator::collect
converts an iterator into a collection such as Vec
,
which typically requires an allocation. You should avoid calling collect
if
the collection is then only iterated over again.
For this reason, it is often better to return an iterator type like impl Iterator<Item=T>
from a function than a Vec<T>
. Note that sometimes
additional lifetimes are required on these return types, as this post
explains.
Example.
Similarly, you can use extend
to extend an existing collection (such as a
Vec
) with an iterator, rather than collecting the iterator into a Vec
and
then using append
.
Finally, when you write an iterator it is often worth implementing the
Iterator::size_hint
or ExactSizeIterator::len
method, if possible.
collect
and extend
calls that use the iterator may then do fewer
allocations, because they have advance information about the number of elements
yielded by the iterator.
Chaining
chain
can be very convenient, but it can also be slower than a single
iterator. It may be worth avoiding for hot iterators, if possible.
Example.
Similarly, filter_map
may be faster than using filter
followed by
map
.
I/O
Locking
Rust’s print!
and println!
macros lock stdout on every call. If you
have repeated calls to these macros it may be better to lock stdout manually.
For example, change this code:
#![allow(unused)] fn main() { let lines = vec!["one", "two", "three"]; for line in lines { println!("{}", line); } }
to this:
#![allow(unused)] fn main() { fn blah() -> Result<(), std::io::Error> { let lines = vec!["one", "two", "three"]; use std::io::Write; let mut stdout = std::io::stdout(); let mut lock = stdout.lock(); for line in lines { writeln!(lock, "{}", line)?; } // stdout is unlocked when `lock` is dropped Ok(()) } }
stdin and stderr can likewise be locked when doing repeated operations on them.
Buffering
Rust file I/O is unbuffered by default. If you have many small and repeated
read or write calls to a file or network socket, use BufReader
or
BufWriter
. They maintain an in-memory buffer for input and output,
minimizing the number of system calls required.
For example, change this unbuffered output code:
#![allow(unused)] fn main() { fn blah() -> Result<(), std::io::Error> { let lines = vec!["one", "two", "three"]; use std::io::Write; let mut out = std::fs::File::create("test.txt").unwrap(); for line in lines { writeln!(out, "{}", line)?; } Ok(()) } }
to this:
#![allow(unused)] fn main() { fn blah() -> Result<(), std::io::Error> { let lines = vec!["one", "two", "three"]; use std::io::{BufWriter, Write}; let mut out = std::fs::File::create("test.txt")?; let mut buf = BufWriter::new(out); for line in lines { writeln!(buf, "{}", line)?; } buf.flush()?; Ok(()) } }
The explicit call to flush
is not strictly necessary, as flushing will
happen automatically when buf
is dropped. However, in that case any error
that occurs on flushing will be ignored, whereas an explicit flush will make
that error explicit.
Note that buffering also works with stdout, so you might want to combine manual locking and buffering when making many writes to stdout.
Reading Input as Raw Bytes
The built-in String type uses UTF-8 internally, which adds a small, but
nonzero overhead caused by UTF-8 validation when you read input into it. If you
just want to process input bytes without worrying about UTF-8 (for example if
you handle ASCII text), you can use BufRead::read_until
.
There are also dedicated crates for reading byte-oriented lines of data and working with byte strings.
日志与调试
Sometimes logging code or debugging code can slow down a program significantly. Either the logging/debugging code itself is slow, or data collection code that feeds into logging/debugging code is slow. Make sure that no unnecessary work is done for logging/debugging purposes when logging/debugging is not enabled. Example 1, Example 2.
Note that assert!
calls always run, but debug_assert!
calls only run in
debug builds. If you have an assertion that is hot but is not necessary for
safety, consider making it a debug_assert!
.
Example 1,
Example 2.
包装类型
Rust has a variety of “wrapper” types, such as RefCell
and Mutex
, that
provide special behavior for values. Accessing these values can take a
non-trivial amount of time. If multiple such values are typically accessed
together, it may be better to put them within a single wrapper.
For example, a struct like this:
#![allow(unused)] fn main() { use std::sync::{Arc, Mutex}; struct S { x: Arc<Mutex<u32>>, y: Arc<Mutex<u32>>, } }
may be better represented like this:
#![allow(unused)] fn main() { use std::sync::{Arc, Mutex}; struct S { xy: Arc<Mutex<(u32, u32)>>, } }
Whether or not this helps performance will depend on the exact access patterns of the values. Example.
机器代码
When you have a small piece of very hot code, it may be worth inspecting the generated machine code to see if it has any inefficiencies. The Compiler Explorer website is an excellent resource when doing this.
Relatedly, the core::arch
module provides access to architecture-specific
intrinsics, many of which relate to SIMD instructions.
It is sometimes possible to avoid bounds checking within loops by adding assertions on the ranges of the index variables. This is an advanced technique, and you should check the generated code to ensure the bounds checks are actually removed. Example 1, Example 2.
并发
Rust为安全并发编程提供了极为优秀的支持,并发可能带来很大的性能提升。 有许多引入并发的方法。而一个程序最合适的是哪一个,非常依赖它自身的设计。
深入介绍并发超出了本书讨论的范围。如果你对这个话题感兴趣,可以先看看
rayon
和 crossbeam
。
二进制大小
有时候你可能需要减小 Rust 编译后的二进制文件体积。
这种情况的话,你应该参考 min-sized-rust
仓库中更全面的文档。
通用技巧
这本书在本章节前讨论了限定于Rust的技术。 而这一节我们会简单讨论一些通用的概念。
避免低级错误(例如,[使用非release构建])以后,Rust程序通常性能不错。
优化后的代码往往会更复杂,而且更难写出。为此,一般我们只优化热点代码。
能带来最大性能提升的通常是对算法或数据结构的修改,而不是底层优化。 例 1, 例 2.
编写现代机器友好的代码有时并不容易,但值得一试。例如,你可以尝试最小化缓存不命中和分支预测失败。
大多数优化只会带来小的性能提升,即使单说其中一个不会带来可观测的变化,你多加利用的话也能带来不错的提升。
不同的性能分析器有不同的优点,没必要拘泥于其中一个。
译注:比如 memcpy 开销太大,你需要选择后者。
通过性能分析判断出热点函数后,有两种常用的做法,一是让它更快,二是避免对它多次调用。
It is often easier to eliminate silly slowdowns than it is to introduce clever speedups.
避免不必要的计算。惰性计算通常会带来提升。 例 1, 例 2.
Complex general cases can often be avoided by optimistically checking for common special cases that are simpler. Example 1, Example 2, Example 3.
In particular, specially handling collections with 0, 1, or 2 elements is often a win when small sizes dominate. Example 1, Example 2, Example 3, Example 4.
Similarly, when dealing with repetitive data, it is often possible to use a simple form of data compression, by using a compact representation for common values and then having a fallback to a secondary table for unusual values. Example 1, Example 2, Example 3.
When code deals with multiple cases, measure case frequencies and handle the most common ones first.
When dealing with lookups that involve high locality, it can be a win to put a small cache in front of a data structure.
Optimized code often has a non-obvious structure, which means that explanatory comments are valuable, particularly those that reference profiling measurements. A comment like “99% of the time this vector has 0 or 1 elements, so handle those cases first” can be illuminating.
编译时间
虽然这本书主要关于提升Rust程序的性能,这一节是关于减少Rust程序的编译时间, 因为这也是大多人关注的重要话题。
Linking
编译时间,其实有很大一部分是链接时间。尤其是小更改以后重新构建程序时。 我们可以选择比默认更快的链接器。
这里推荐lld,它支持ELF,PE/COFF,Mach-O,wasm等等。
通过命令行指定使用 lld,你可以在你的构建命令前加上
RUSTFLAGS="-C link-arg=-fuse-ld=lld"
通过config.toml指定使用 lld(应用于一个或者多个项目),加入这些行:
[build]
rustflags = ["-C", "link-arg=-fuse-ld=lld"]
lld并未完全支持让Rust使用,但大多情况下可以。有个Github Issue追踪lld的完整支持。
另外你也可以选择mold,目前只支持ELF。指定使用它和lld一样,把lld
换成mold
就可以了。
mold通常比lld更快。它也更新,有时无法工作。
增量编译
Rust编译器支持增量编译,避免在重编译的时候做重复的工作。它可以大大提升编译速度,
但有时会让生成的可执行程序运行的慢一些。因此,它只默认为调试构建启用。
如果你也想为发布构建启用,把这些加到Cargo.toml
:
[profile.release]
incremental = true
incremental
设置,以及为不同配置启用特定设置详见Cargo文档。
可视化
Cargo有个功能,可以可视化你的程序的编译过程。在Rust 1.60以上使用该命令构建:
cargo build --timings
或者(1.59以下):
cargo +nightly build -Ztimings
On completion it will print the name of an HTML file. Open that file in a web browser. It contains a Gantt chart that shows the dependencies between the various crates in your program. This shows how much parallelism there is in your crate graph, which can indicate if any large crates that serialize compilation should be broken up. See the documentation for more details on how to read the graphs.
LLVM IR
The Rust compiler uses LLVM for its back-end. LLVM’s execution can be a large part of compile times, especially when the Rust compiler’s front end generates a lot of IR which takes LLVM a long time to optimize.
These problems can be diagnosed with cargo llvm-lines
, which shows which
Rust functions cause the most LLVM IR to be generated. Generic functions are
often the most important ones, because they can be instantiated dozens or even
hundreds of times in large programs.
If a generic function causes IR bloat, there are several ways to fix it. The simplest is to just make the function smaller. Example 1, Example 2.
Another way is to move the non-generic parts of the function into a separate, non-generic function, which will only be instantiated once. Whether or not this is possible will depend on the details of the generic function. The non-generic function can often be written as an inner function within the generic function, to minimize its exposure to the rest of the code. Example.
Sometimes common utility functions like Option::map
and Result::map_err
are instantiated many times. Replacing them with equivalent match
expressions
can help compile times.
The effects of these sorts of changes on compile times will usually be small, though occasionally they can be large. Example.