堆分配
堆分配不总是昂贵的,具体情况取决于使用了哪种分配器。但通常每次分配/释放都涉及获取全局锁, 做一些复杂的数据结构操作,甚至可能需要执行一次系统调用。小的分配也不一定比大的分配廉价。 我们应该知道哪些Rust数据结构(的哪些操作)可能导致堆分配,因为消除堆分配可以很大提升性能。
Rust Container Cheat Sheet有可视化的Rust常用类型,是对接下来的章节很好的补充。
性能分析
如果性能分析显示 malloc
,free
以及相关的函数为热点,你可能需要尝试减少分配率,或者换用其他分配器。
DHAT is an excellent profiler to use when reducing allocation rates. It works on Linux and some other Unixes. It precisely identifies hot allocation sites and their allocation rates. Exact results will vary, but experience with rustc has shown that reducing allocation rates by 10 allocations per million instructions executed can have measurable performance improvements (e.g. ~1%).
Here is some example output from DHAT.
AP 1.1/25 (2 children) {
Total: 54,533,440 bytes (4.02%, 2,714.28/Minstr) in 458,839 blocks (7.72%, 22.84/Minstr), avg size 118.85 bytes, avg lifetime 1,127,259,403.64 instrs (5.61% of program duration)
At t-gmax: 0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
At t-end: 0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
Reads: 15,993,012 bytes (0.29%, 796.02/Minstr), 0.29/byte
Writes: 20,974,752 bytes (1.03%, 1,043.97/Minstr), 0.38/byte
Allocated at {
#1: 0x95CACC9: alloc (alloc.rs:72)
#2: 0x95CACC9: alloc (alloc.rs:148)
#3: 0x95CACC9: reserve_internal<syntax::tokenstream::TokenStream,alloc::alloc::Global> (raw_vec.rs:669)
#4: 0x95CACC9: reserve<syntax::tokenstream::TokenStream,alloc::alloc::Global> (raw_vec.rs:492)
#5: 0x95CACC9: reserve<syntax::tokenstream::TokenStream> (vec.rs:460)
#6: 0x95CACC9: push<syntax::tokenstream::TokenStream> (vec.rs:989)
#7: 0x95CACC9: parse_token_trees_until_close_delim (tokentrees.rs:27)
#8: 0x95CACC9: syntax::parse::lexer::tokentrees::<impl syntax::parse::lexer::StringReader<'a>>::parse_token_tree (tokentrees.rs:81)
}
}
It is beyond the scope of this book to describe everything in this example, but it should be clear that DHAT gives a wealth of information about allocations, such as where and how often they happen, how big they are, how long they live for, and how often they are accessed.
Box
Box
is the simplest heap-allocated type. A Box<T>
value is a T
value
that is allocated on the heap.
It is sometimes worth boxing one or more fields in a struct or enum fields to make a type smaller. (See the Type Sizes chapter for more about this.)
Other than that, Box
is straightforward and does not offer much scope for
optimizations.
Rc
/Arc
Rc
/Arc
are similar to Box
, but the value on the heap is accompanied by
two reference counts. They allow value sharing, which can be an effective way
to reduce memory usage.
However, if used for values that are rarely shared, they can increase allocation rates by heap allocating values that might otherwise not be heap-allocated. Example.
Unlike Box
, calling clone
on an Rc
/Arc
value does not involve an
allocation. Instead, it merely increments a reference count.
Vec
Vec
is a heap-allocated type with a great deal of scope for optimizing the
number of allocations, and/or minimizing the amount of wasted space. To do this
requires understanding how its elements are stored.
A Vec
contains three words: a length, a capacity, and a pointer. The pointer
will point to heap-allocated memory if the capacity is nonzero and the element
size is nonzero; otherwise, it will not point to allocated memory.
Even if the Vec
itself is not heap-allocated, the elements (if present and
nonzero-sized) always will be. If nonzero-sized elements are present, the
memory holding those elements may be larger than necessary, providing space for
additional future elements. The number of elements present is the length, and
the number of elements that could be held without reallocating is the capacity.
When the vector needs to grow beyond its current capacity, the elements will be copied into a larger heap allocation, and the old heap allocation will be freed.
Vec
growth
A new, empty Vec
created by the common means
(vec![]
or Vec::new
or Vec::default
) has a length and capacity of zero, and no
heap allocation is required. If you repeatedly push individual elements onto
the end of the Vec
, it will periodically reallocate. The growth strategy is
not specified, but at the time of writing it uses a quasi-doubling strategy
resulting in the following capacities: 0, 4, 8, 16, 32, 64, and so on. (It
skips directly from 0 to 4, instead of going via 1 and 2, because this avoids
many allocations in practice.) As a vector grows, the frequency of
reallocations will decrease exponentially, but the amount of possibly-wasted
excess capacity will increase exponentially.
This growth strategy is typical for growable data structures and reasonable in
the general case, but if you know in advance the likely length of a vector you
can often do better. If you have a hot vector allocation site (e.g. a hot
Vec::push
call), it is worth using eprintln!
to print the vector length
at that site and then doing some post-processing (e.g. with counts
) to
determine the length distribution. For example, you might have many short
vectors, or you might have a smaller number of very long vectors, and the best
way to optimize the allocation site will vary accordingly.
Short Vec
s
If you have many short vectors, you can use the SmallVec
type from the
smallvec
crate. SmallVec<[T; N]>
is a drop-in replacement for Vec
that
can store N
elements within the SmallVec
itself, and then switches to a
heap allocation if the number of elements exceeds that. (Note also that
vec![]
literals must be replaced with smallvec![]
literals.)
Example 1,
Example 2.
SmallVec
reliably reduces the allocation rate when used appropriately, but
its use does not guarantee improved performance. It is slightly slower than
Vec
for normal operations because it must always check if the elements are
heap-allocated or not. Also, If N
is high or T
is large, then the
SmallVec<[T; N]>
itself can be larger than Vec<T>
, and copying of
SmallVec
values will be slower. As always, benchmarking is required to
confirm that an optimization is effective.
If you have many short vectors and you precisely know their maximum length,
ArrayVec
from the arrayvec
crate is a better choice than SmallVec
. It
does not require the fallback to heap allocation, which makes it a little
faster.
Example.
Longer Vec
s
If you know the minimum or exact size of a vector, you can reserve a specific
capacity with Vec::with_capacity
, Vec::reserve
, or
Vec::reserve_exact
. For example, if you know a vector will grow to have at
least 20 elements, these functions can immediately provide a vector with a
capacity of at least 20 using a single allocation, whereas pushing the items
one at a time would result in four allocations (for capacities of 4, 8, 16, and
32).
Example.
If you know the maximum length of a vector, the above functions also let you
not allocate excess space unnecessarily. Similarly, Vec::shrink_to_fit
can be
used to minimize wasted space, but note that it may cause a reallocation.
String
A String
contains heap-allocated bytes. The representation and operation of
String
are very similar to that of Vec<u8>
. Many Vec
methods relating to
growth and capacity have equivalents for String
, such as
String::with_capacity
.
The SmallString
type from the smallstr
crate is similar to the SmallVec
type.
The String
type from the smartstring
crate is a drop-in replacement for
String
that avoids heap allocations for strings with less than three words’
worth of characters. On 64-bit platforms, this is any string that is less than
24 bytes, which includes all strings containing 23 or fewer ASCII characters.
Example.
Note that the format!
macro produces a String
, which means it performs an
allocation. If you can avoid a format!
call by using a string literal, that
will avoid this allocation.
Example.
std::format_args
and/or the lazy_format
crate may help with this.
Hash tables
HashSet
and HashMap
are hash tables. Their representation and
operations are similar to those of Vec
, in terms of allocations: they have
a single contiguous heap allocation, holding keys and values, which is
reallocated as necessary as the table grows. Many Vec
methods relating to
growth and capacity have equivalents for HashSet
/HashMap
, such as
HashSet::with_capacity
.
Cow
Sometimes you have some borrowed data, such as a &str
, that is mostly
read-only but occasionally needs to be modified. Cloning the data every time
would be wasteful. Instead you can use “clone-on-write” semantics via the
Cow
type, which can represent both borrowed and owned data.
Typically, when starting with a borrowed value x
you wrap it in a Cow
with
Cow::Borrowed(x)
. Because Cow
implements Deref
, you can call
non-mutating methods directly on the data it encloses. If mutation is desired,
Cow::to_mut
will obtain a mutable reference to an owned value, cloning if
necessary.
Cow
can be fiddly to get working, but it is often worth the effort.
Example 1,
Example 2,
Example 3,
Example 4.
clone
Calling clone
on a value that contains heap-allocated memory typically
involves additional allocations. For example, calling clone
on a non-empty
Vec
requires a new allocation for the elements (but note that the capacity of
the new Vec
might not be the same as the capacity of the original Vec
). The
exception is Rc
/Arc
, where a clone
call just increments the reference
count.
clone_from
is an alternative to clone
. a.clone_from(&b)
is equivalent
to a = b.clone()
but may avoid unnecessary allocations. For example, if you
want to clone one Vec
over the top of an existing Vec
, the existing Vec
’s
heap allocation will be reused if possible, as the following example shows.
#![allow(unused)] fn main() { let mut v1: Vec<u32> = Vec::with_capacity(99); let v2: Vec<u32> = vec![1, 2, 3]; v1.clone_from(&v2); // v1's allocation is reused assert_eq!(v1.capacity(), 99); }
Although clone
usually causes allocations, it is a reasonable thing to use in
many circumstances and can often make code simpler. Use profiling data to see
which clone
calls are hot and worth taking the effort to avoid.
Sometimes Rust code ends up containing unnecessary clone
calls, due to (a)
programmer error, or (b) changes in the code that render previously-necessary
clone
calls unnecessary. If you see a hot clone
call that does not seem
necessary, sometimes it can simply be removed.
Example 1,
Example 2,
Example 3.
to_owned
ToOwned::to_owned
is implemented for many common types. It creates owned
data from borrowed data, usually by cloning, and therefore often causes heap
allocations. For example, it can be used to create a String
from a &str
.
Sometimes to_owned
calls can be avoided by storing a reference to borrowed
data in a struct rather than an owned copy. This requires lifetime annotations
on the struct, complicating the code, and should only be done when profiling
and benchmarking shows that it is worthwhile.
Example.
Reusing Collections
Sometimes you need to build up a collection such as a Vec
in stages. It is
usually better to do this by modifying a single Vec
than by building multiple
Vec
s and then combining them.
For example, if you have a function do_stuff
that produces a Vec
that might
be called multiple times:
#![allow(unused)] fn main() { fn do_stuff(x: u32, y: u32) -> Vec<u32> { vec![x, y] } }
It might be better to instead modify a passed-in Vec
:
#![allow(unused)] fn main() { fn do_stuff(x: u32, y: u32, vec: &mut Vec<u32>) { vec.push(x); vec.push(y); } }
Sometimes it is worth keeping around a “workhorse” collection that can be
reused. For example, if a Vec
is needed for each iteration of a loop, you
could declare the Vec
outside the loop, use it within the loop body, and then
call clear
at the end of the loop body (to empty the Vec
without affecting
its capacity). This avoids allocations at the cost of obscuring the fact that
each iteration’s usage of the Vec
is unrelated to the others.
Example 1,
Example 2.
Similarly, it is sometimes worth keeping a “workhorse” collection within a struct, to be reused in one or more methods that are called repeatedly.
Using an Alternative Allocator
Another option for improving the performance of allocation-heavy Rust programs is to replace the default (system) allocator with an alternative allocator. The exact effect will depend on the individual program and the alternative allocator chosen, but large improvements in speed and very large reductions in memory usage have been seen in practice. The effect will also vary across platforms, because each platform’s system allocator has its own strengths and weaknesses. The use of an alternative allocator can also affect binary size.
One popular alternative allocator for Linux and Mac is jemalloc, usable via
the tikv-jemallocator
crate. To use it, add a dependency to your
Cargo.toml
file:
[dependencies]
tikv-jemallocator = "0.4.0"
Then add the following somewhere in your Rust code:
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
tikv-jemallocator
is a fork of the jemallocator
crate. As of December
2021, tikv-jemallocator
uses a jemalloc version that is newer and has better
performance than the jemalloc version used by jemallocator
.
Another alternative allocator that works on many platforms is mimalloc,
usable via the mimalloc
crate.
Avoiding Regressions
To ensure the number and/or size of allocations done by your code doesn’t increase unintentionally, you can use the heap usage testing feature of dhat-rs to write tests that check particular code snippets allocate the expected amount of heap memory.