堆分配

堆分配不总是昂贵的,具体情况取决于使用了哪种分配器。但通常每次分配/释放都涉及获取全局锁, 做一些复杂的数据结构操作,甚至可能需要执行一次系统调用。小的分配也不一定比大的分配廉价。 我们应该知道哪些Rust数据结构(的哪些操作)可能导致堆分配,因为消除堆分配可以很大提升性能。

Rust Container Cheat Sheet有可视化的Rust常用类型,是对接下来的章节很好的补充。

性能分析

如果性能分析显示 mallocfree以及相关的函数为热点,你可能需要尝试减少分配率,或者换用其他分配器。

DHAT is an excellent profiler to use when reducing allocation rates. It works on Linux and some other Unixes. It precisely identifies hot allocation sites and their allocation rates. Exact results will vary, but experience with rustc has shown that reducing allocation rates by 10 allocations per million instructions executed can have measurable performance improvements (e.g. ~1%).

Here is some example output from DHAT.

AP 1.1/25 (2 children) {
  Total:     54,533,440 bytes (4.02%, 2,714.28/Minstr) in 458,839 blocks (7.72%, 22.84/Minstr), avg size 118.85 bytes, avg lifetime 1,127,259,403.64 instrs (5.61% of program duration)
  At t-gmax: 0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
  At t-end:  0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
  Reads:     15,993,012 bytes (0.29%, 796.02/Minstr), 0.29/byte
  Writes:    20,974,752 bytes (1.03%, 1,043.97/Minstr), 0.38/byte
  Allocated at {
    #1: 0x95CACC9: alloc (alloc.rs:72)
    #2: 0x95CACC9: alloc (alloc.rs:148)
    #3: 0x95CACC9: reserve_internal<syntax::tokenstream::TokenStream,alloc::alloc::Global> (raw_vec.rs:669)
    #4: 0x95CACC9: reserve<syntax::tokenstream::TokenStream,alloc::alloc::Global> (raw_vec.rs:492)
    #5: 0x95CACC9: reserve<syntax::tokenstream::TokenStream> (vec.rs:460)
    #6: 0x95CACC9: push<syntax::tokenstream::TokenStream> (vec.rs:989)
    #7: 0x95CACC9: parse_token_trees_until_close_delim (tokentrees.rs:27)
    #8: 0x95CACC9: syntax::parse::lexer::tokentrees::<impl syntax::parse::lexer::StringReader<'a>>::parse_token_tree (tokentrees.rs:81)
  }
}

It is beyond the scope of this book to describe everything in this example, but it should be clear that DHAT gives a wealth of information about allocations, such as where and how often they happen, how big they are, how long they live for, and how often they are accessed.

Box

Box is the simplest heap-allocated type. A Box<T> value is a T value that is allocated on the heap.

It is sometimes worth boxing one or more fields in a struct or enum fields to make a type smaller. (See the Type Sizes chapter for more about this.)

Other than that, Box is straightforward and does not offer much scope for optimizations.

Rc/Arc

Rc/Arc are similar to Box, but the value on the heap is accompanied by two reference counts. They allow value sharing, which can be an effective way to reduce memory usage.

However, if used for values that are rarely shared, they can increase allocation rates by heap allocating values that might otherwise not be heap-allocated. Example.

Unlike Box, calling clone on an Rc/Arc value does not involve an allocation. Instead, it merely increments a reference count.

Vec

Vec is a heap-allocated type with a great deal of scope for optimizing the number of allocations, and/or minimizing the amount of wasted space. To do this requires understanding how its elements are stored.

A Vec contains three words: a length, a capacity, and a pointer. The pointer will point to heap-allocated memory if the capacity is nonzero and the element size is nonzero; otherwise, it will not point to allocated memory.

Even if the Vec itself is not heap-allocated, the elements (if present and nonzero-sized) always will be. If nonzero-sized elements are present, the memory holding those elements may be larger than necessary, providing space for additional future elements. The number of elements present is the length, and the number of elements that could be held without reallocating is the capacity.

When the vector needs to grow beyond its current capacity, the elements will be copied into a larger heap allocation, and the old heap allocation will be freed.

Vec growth

A new, empty Vec created by the common means (vec![] or Vec::new or Vec::default) has a length and capacity of zero, and no heap allocation is required. If you repeatedly push individual elements onto the end of the Vec, it will periodically reallocate. The growth strategy is not specified, but at the time of writing it uses a quasi-doubling strategy resulting in the following capacities: 0, 4, 8, 16, 32, 64, and so on. (It skips directly from 0 to 4, instead of going via 1 and 2, because this avoids many allocations in practice.) As a vector grows, the frequency of reallocations will decrease exponentially, but the amount of possibly-wasted excess capacity will increase exponentially.

This growth strategy is typical for growable data structures and reasonable in the general case, but if you know in advance the likely length of a vector you can often do better. If you have a hot vector allocation site (e.g. a hot Vec::push call), it is worth using eprintln! to print the vector length at that site and then doing some post-processing (e.g. with counts) to determine the length distribution. For example, you might have many short vectors, or you might have a smaller number of very long vectors, and the best way to optimize the allocation site will vary accordingly.

Short Vecs

If you have many short vectors, you can use the SmallVec type from the smallvec crate. SmallVec<[T; N]> is a drop-in replacement for Vec that can store N elements within the SmallVec itself, and then switches to a heap allocation if the number of elements exceeds that. (Note also that vec![] literals must be replaced with smallvec![] literals.) Example 1, Example 2.

SmallVec reliably reduces the allocation rate when used appropriately, but its use does not guarantee improved performance. It is slightly slower than Vec for normal operations because it must always check if the elements are heap-allocated or not. Also, If N is high or T is large, then the SmallVec<[T; N]> itself can be larger than Vec<T>, and copying of SmallVec values will be slower. As always, benchmarking is required to confirm that an optimization is effective.

If you have many short vectors and you precisely know their maximum length, ArrayVec from the arrayvec crate is a better choice than SmallVec. It does not require the fallback to heap allocation, which makes it a little faster. Example.

Longer Vecs

If you know the minimum or exact size of a vector, you can reserve a specific capacity with Vec::with_capacity, Vec::reserve, or Vec::reserve_exact. For example, if you know a vector will grow to have at least 20 elements, these functions can immediately provide a vector with a capacity of at least 20 using a single allocation, whereas pushing the items one at a time would result in four allocations (for capacities of 4, 8, 16, and 32). Example.

If you know the maximum length of a vector, the above functions also let you not allocate excess space unnecessarily. Similarly, Vec::shrink_to_fit can be used to minimize wasted space, but note that it may cause a reallocation.

String

A String contains heap-allocated bytes. The representation and operation of String are very similar to that of Vec<u8>. Many Vec methods relating to growth and capacity have equivalents for String, such as String::with_capacity.

The SmallString type from the smallstr crate is similar to the SmallVec type.

The String type from the smartstring crate is a drop-in replacement for String that avoids heap allocations for strings with less than three words’ worth of characters. On 64-bit platforms, this is any string that is less than 24 bytes, which includes all strings containing 23 or fewer ASCII characters. Example.

Note that the format! macro produces a String, which means it performs an allocation. If you can avoid a format! call by using a string literal, that will avoid this allocation. Example. std::format_args and/or the lazy_format crate may help with this.

Hash tables

HashSet and HashMap are hash tables. Their representation and operations are similar to those of Vec, in terms of allocations: they have a single contiguous heap allocation, holding keys and values, which is reallocated as necessary as the table grows. Many Vec methods relating to growth and capacity have equivalents for HashSet/HashMap, such as HashSet::with_capacity.

Cow

Sometimes you have some borrowed data, such as a &str, that is mostly read-only but occasionally needs to be modified. Cloning the data every time would be wasteful. Instead you can use “clone-on-write” semantics via the Cow type, which can represent both borrowed and owned data.

Typically, when starting with a borrowed value x you wrap it in a Cow with Cow::Borrowed(x). Because Cow implements Deref, you can call non-mutating methods directly on the data it encloses. If mutation is desired, Cow::to_mut will obtain a mutable reference to an owned value, cloning if necessary.

Cow can be fiddly to get working, but it is often worth the effort. Example 1, Example 2, Example 3, Example 4.

clone

Calling clone on a value that contains heap-allocated memory typically involves additional allocations. For example, calling clone on a non-empty Vec requires a new allocation for the elements (but note that the capacity of the new Vec might not be the same as the capacity of the original Vec). The exception is Rc/Arc, where a clone call just increments the reference count.

clone_from is an alternative to clone. a.clone_from(&b) is equivalent to a = b.clone() but may avoid unnecessary allocations. For example, if you want to clone one Vec over the top of an existing Vec, the existing Vec’s heap allocation will be reused if possible, as the following example shows.

#![allow(unused)]
fn main() {
let mut v1: Vec<u32> = Vec::with_capacity(99);
let v2: Vec<u32> = vec![1, 2, 3];
v1.clone_from(&v2); // v1's allocation is reused
assert_eq!(v1.capacity(), 99);
}

Although clone usually causes allocations, it is a reasonable thing to use in many circumstances and can often make code simpler. Use profiling data to see which clone calls are hot and worth taking the effort to avoid.

Sometimes Rust code ends up containing unnecessary clone calls, due to (a) programmer error, or (b) changes in the code that render previously-necessary clone calls unnecessary. If you see a hot clone call that does not seem necessary, sometimes it can simply be removed. Example 1, Example 2, Example 3.

to_owned

ToOwned::to_owned is implemented for many common types. It creates owned data from borrowed data, usually by cloning, and therefore often causes heap allocations. For example, it can be used to create a String from a &str.

Sometimes to_owned calls can be avoided by storing a reference to borrowed data in a struct rather than an owned copy. This requires lifetime annotations on the struct, complicating the code, and should only be done when profiling and benchmarking shows that it is worthwhile. Example.

Reusing Collections

Sometimes you need to build up a collection such as a Vec in stages. It is usually better to do this by modifying a single Vec than by building multiple Vecs and then combining them.

For example, if you have a function do_stuff that produces a Vec that might be called multiple times:

#![allow(unused)]
fn main() {
fn do_stuff(x: u32, y: u32) -> Vec<u32> {
    vec![x, y]
}
}

It might be better to instead modify a passed-in Vec:

#![allow(unused)]
fn main() {
fn do_stuff(x: u32, y: u32, vec: &mut Vec<u32>) {
    vec.push(x);
    vec.push(y);
}
}

Sometimes it is worth keeping around a “workhorse” collection that can be reused. For example, if a Vec is needed for each iteration of a loop, you could declare the Vec outside the loop, use it within the loop body, and then call clear at the end of the loop body (to empty the Vec without affecting its capacity). This avoids allocations at the cost of obscuring the fact that each iteration’s usage of the Vec is unrelated to the others. Example 1, Example 2.

Similarly, it is sometimes worth keeping a “workhorse” collection within a struct, to be reused in one or more methods that are called repeatedly.

Using an Alternative Allocator

Another option for improving the performance of allocation-heavy Rust programs is to replace the default (system) allocator with an alternative allocator. The exact effect will depend on the individual program and the alternative allocator chosen, but large improvements in speed and very large reductions in memory usage have been seen in practice. The effect will also vary across platforms, because each platform’s system allocator has its own strengths and weaknesses. The use of an alternative allocator can also affect binary size.

One popular alternative allocator for Linux and Mac is jemalloc, usable via the tikv-jemallocator crate. To use it, add a dependency to your Cargo.toml file:

[dependencies]
tikv-jemallocator = "0.4.0"

Then add the following somewhere in your Rust code:

#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;

tikv-jemallocator is a fork of the jemallocator crate. As of December 2021, tikv-jemallocator uses a jemalloc version that is newer and has better performance than the jemalloc version used by jemallocator.

Another alternative allocator that works on many platforms is mimalloc, usable via the mimalloc crate.

Avoiding Regressions

To ensure the number and/or size of allocations done by your code doesn’t increase unintentionally, you can use the heap usage testing feature of dhat-rs to write tests that check particular code snippets allocate the expected amount of heap memory.