Understanding Arc in Rust

Tracy,Sat Jun 01 2024•rust concurrency atomics

This article was posted to my X account! You can find it here: https://x.com/tracy_codes/status/1796980877093986488 (opens in a new tab)

Rust's Arc (Atomic Reference Counting) is a crucial feature for safe concurrency, providing thread-safe reference counting. But what are the intricacies and performance implications? Well, I've been deep in it for a while, so let's share the knowledge. At the end of this article, I've provided some example code and benchmarks for different workloads.

Concurrency & Arc Primer

Concurrency in programming is essential for leveraging multi-core processors, allowing multiple tasks to run simultaneously and improving performance. However, with great power (concurrency) comes great responsibility (safely sharing data between threads). Rust puts an extreme amount of focus on safety and performance, providing several tools to handle concurrency. Arc is one of the widely used among them.

Arc is designed to enable multiple ownership of data across threads. When you create an Arc , it starts with a reference count of one, indicating a single owner. Cloning an Arc produces a new pointer to the same data, incrementing the reference count atomically. This atomic increment ensures that the operation is safe and visible to all threads, preventing race conditions where multiple threads might attempt to modify the reference count simultaneously.

The reference count is Arc is managed using atomic operations, which are low-level operations provided by the CPU to ensure that certain critical operations, like incrementing or decrementing a reference count, are performed atomically. These operations are essential for maintaining the consistency and integrity of the reference count across multiple threads.

When an Arc is no longer needed, it is dropped, and the reference count is decremented atomically. If the reference count reaches zero, indicating that there are no more owners of the data, the data is de-allocated. This ensures that memory is properly cleaned up, preventing memory leaks and ensuring that resources are efficiently managed.

The thread-safety provided by Arc comes at the cost of performance overhead associated with atomic operations. These operations are more expensive than non-atomic operations due to the additional coordination required between CPU cores. However, the safety guarantees they provide make Arc a valuable tool for writing robust and safe concurrent code in Rust.

`Arc` Internals

Arc<T> is a thread-safe smart pointer enabling multiple ownership of data across threads. The T is a generic type, meaning just about anything is Arc-able. It is different from its counterpart, Rc<T> in that Arc utilizes atomic operations for reference counting whereas Rc uses reference counting without atomic operations and thus is not thread-safe.

Arc utilizes the atomic operations fetch_add to increment and fetch_sub to decrement its reference counter. These CPU-level instructions provide guarantees about visibility and ordering of changes across threads.

The internal reference counter within Arc is stored in an atomic integer, typically an AtomicUsize. When an Arc<T> is cloned, the AtomicUsize is incremented with the CPU instructions above. Conversely, when dropped, the AtomicUsize is decremented. Once it reaches zero, the data and reference count are both de-allocated.

In regards to memory ordering, Arc's atomic operations come with specific memory ordering guarantees. It uses Acquire and Release semantics to ensure that operations on the reference count are properly synchronized with other memory operations. This ensures that any changes made to the data before the reference count is incremented are visible to any thread that subsequently observes the incremented reference count.

`Arc` Tradeoffs & Performance Implications

The key tradeoff with Arc<T> is the performance cost with atomic operations. The internal atomic operations, fetch_add and fetch_sub, are more expensive than their non-atomic counterparts. These operations often require special CPU instructions that can be slower than regular memory accesses. This overhead is particularly noticeable in scenarios with frequent cloning and dropping of Arc<T>, where the reference count is updated often.

Since cloning an Arc<T> involves an atomic increment of the reference count, the cost to clone an Arc can be understood as being proportional to the frequency of cloning operations in your program. In read-heavy workloads where data is rarely modified but frequently read and shared across threads, this cost can be amortized over many reads.

When an Arc<T> is fully dropped, an atomic decrement operation is performed on the reference count. If the reference count reaches zero, the data is de-allocated. The final drop involves not only the decrement operation but also the potential de-allocation of the data, which can be a complex operation, especially if it involves freeing large or complex structures. The performance impact of this de-allocation can be signification in some cases, particularly in write-heavy or allocation-heavy workloads.

Atomic operations can also lead to cache contention. When multiple threads frequently update the same atomic variable (the reference count in this case), it can cause cache lines to bounce between CPU cores, leading to increased latency and reduced throughput. This is known as "false sharing" and can degrade performance in highly concurrent systems.

The final tradeoff we'll cover is interior mutability. Arc<T> itself does not provide mutability. To mutate the underlying data, T in this instance, additional synchronization primitives like Mutex<T> (MutualExclusive) or RwLock<T> (ReaderWriterLock) are required. For example, Arc<Mutex<T>> allows for safe, mutable access to the data. However, this combination introduces further overhead due to the locking mechanisms, which can impact performance. The need for interior mutability complicates the design and introduces additional synchronization costs, which must be carefully managed to mitigate bottlenecks.

`Arc` Use Cases

So far, Arc<T> sounds pretty scary, right? Well, there are use cases where this structure is favorable. Arc<T> is well-suited for scenarios where data is read frequently and shared across multiple threads but rarely modified. You can consider this a "high read, low write workload". Below are some examples of this.

Immutable data structures: When you have a large, immutable data structure that needs to be accessed by multiple threads, Arc<T> is an ideal choice. Since the data is immutable, there's no need for additional synchronization mechanisms to protect it, and the overhead of atomic reference counting is relatively low compared to the benefits of safe concurrent access.
Configuration data: in many applications, configuration data is loaded at startup and remains constant throughout the application's lifetime. Using Arc<T> allows multiple threads to access the configuration data without the need for locks or other synchronization mechanisms, ensuring high performance thread safety.
Read-heavy workloads: Applications like a web server that serves static content or a search engine that performs many read queries can use Arc<T> to share data efficiently across threads, allowing to serve more concurrent end-user requests.

Pro Tips

Always profile your application to understand the impact of Arc<T>. Profiling helps identify bottlenecks and areas where the overhead of atomic operations might be significant. Tools like perf, valgrind, or Rust's built-in cargo bench can provide insights into how Arc<T> affects your application's performance.

When designing data structures that will be shared using Arc<T>, consider the memory layout and access patterns. Ensure that frequently accessed data is cache-friendly and minimize the number of atomic operations required. For example, instead of sharing a large monolithic structure, consider breaking it into smaller, independently shareable pieces.

In the case of read-heavy workloads, Arc<RwLock<T>> can be more efficient than Arc<Mutex<T>> because it allows multiple readers to access the data simultaneously. However, be mindful of potential writer starvation if there are many readers. Profiling can help determine the best synchronization primitive for your specific workload.

You may want to consider other concurrency patterns that might be more efficient for your specific workload.

Channels: Channels allow for a different concurrency pattern called message-passing. This is where one thread sends a message containing some data over a channel to another thread. There are many channel implementations for Rust, including the std::sync::mpsc. This is Rust's built-in multi-producer, single consumer channel implementation. This will allow one or more threads to communicate with another thread via sending messages, avoiding the need for shared mutable state and the associated synchronization overhead. If your application can be structured around message passing, channels can provide a simpler and more performant concurrency model.
Thread local storage: If the data is only accessed by a single thread at a time, consider using thread-local storage (TLS). Rust's std::thread::LocalKey provides a way to store data local to a thread, avoiding the need for atomic operations entirely. TLS is particularly useful for thread-specific caches or context data.
Lock-free data structures: For certain use cases, lock-free data structures can provide better performance than Arc<T> combined with locks. Libraries like crossbeam offer lock-free queues and other data structures that can reduce contention and improve throughput in highly concurrent environments.

Code Examples

I don't know about you, but I'm a visual learner. I love me some examples. So below are examples of what we just discussed.

Basic example: sharing immutable data across threads

Playground link: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=7109377f32a7a860a444c0bf15f21e10 (opens in a new tab)

use std::sync::Arc;
use std::thread;
 
fn main() {
    let data = Arc::new(vec![1, 2, 3, 4, 5]);
 
    let mut handles = vec![];
 
    for i in 0..5 {
        let data_clone = Arc::clone(&data);
        let handle = thread::spawn(move || {
            println!("Thread {} sees: {:?}", i, data_clone);
        });
        handles.push(handle);
    }
 
    for handle in handles {
        handle.join().unwrap();
    }
}

Using `Arc` with `Mutex` for mutable data

Playground link: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=c186fe848981d628e183d7528861e1ed (opens in a new tab)

use std::sync::{Arc, Mutex};
use std::thread;
 
fn main() {
    let data = Arc::new(Mutex::new(0));
 
    let mut handles = vec![];
 
    for _ in 0..10 {
        let data_clone = Arc::clone(&data);
        let handle = thread::spawn(move || {
            let mut num = data_clone.lock().unwrap();
            *num += 1;
        });
        handles.push(handle);
    }
 
    for handle in handles {
        handle.join().unwrap();
    }
 
    println!("Result: {}", *data.lock().unwrap());
}

Using `Arc` with `RwLock` for read-heavy workloads

Playground link: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=5cdd188ef049e3a0f8c5f3503eddd27a (opens in a new tab)

use std::sync::{Arc, RwLock};
use std::thread;
 
fn main() {
    let data = Arc::new(RwLock::new(0));
 
    let mut handles = vec![];
 
    for _ in 0..10 {
        let data_clone = Arc::clone(&data);
        let handle = thread::spawn(move || {
            let read_guard = data_clone.read().unwrap();
            println!("Read: {}", *read_guard);
        });
        handles.push(handle);
    }
 
    for _ in 0..2 {
        let data_clone = Arc::clone(&data);
        let handle = thread::spawn(move || {
            let mut write_guard = data_clone.write().unwrap();
            *write_guard += 1;
        });
        handles.push(handle);
    }
 
    for handle in handles {
        handle.join().unwrap();
    }
 
    println!("Final value: {}", *data.read().unwrap());
}

Using `Arc` in a real-world scenario: shared configuration data

Playground link: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=0c739aa80c00ef68ed71cd33ec2b9344 (opens in a new tab)

use std::sync::Arc;
use std::thread;
 
#[derive(Debug)]
struct Config {
    name: String,
    version: u32,
}
 
fn main() {
    let config = Arc::new(Config {
        name: "MyApp".to_string(),
        version: 1,
    });
 
    let mut handles = vec![];
 
    for i in 0..5 {
        let config_clone = Arc::clone(&config);
        let handle = thread::spawn(move || {
            println!("Thread {} sees config: {:?}", i, config_clone);
        });
        handles.push(handle);
    }
 
    for handle in handles {
        handle.join().unwrap();
    }
}

Basic example of using channels for message passing between threads

Playground link: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=317c4a77f989c4bbb750c510137a4d0c (opens in a new tab)

use std::sync::mpsc;
use std::thread;
 
fn main() {
    let (tx, rx) = mpsc::channel();
 
    for i in 0..5 {
        let tx_clone = tx.clone();
        thread::spawn(move || {
            tx_clone.send(i).unwrap();
        });
    }
 
    for _ in 0..5 {
        println!("Received: {}", rx.recv().unwrap());
    }
}

Benchmark results

The following benchmark results cover the following workloads:

Read-heavy over 10 operations
Write-heavy over 10 operations
Mixed read + write over 10 operations

Each benchmark run introduces a 25ms delay for every other (evenly-distributed, not random) loop iteration. This introduces some simulated waiting for each benchmark to show what happens when some "computation" is happening with these concurrency primitives.

They were obtained on an i9 10900k @ ~4.75ghz and Rust 1.78.0. You can go here (opens in a new tab) to view the benchmark code yourself.

Read-heavy

Arc<Mutex>: 126.15ms
Arc<RwLock>: 25.479ms
MPSC Channel: 50.954ms

Write-heavy

Arc<Mutex>: 126.15ms
Arc<RwLock>: 126.15ms
MPSC Channel: 50.977ms

Mixed read + write

Arc<Mutex>: 126.15ms
Arc<RwLock>: 50.670ms
MPSC Channel: 25.545ms

These benchmarks do not dive into the specifics of memory and CPU usage, this is just to show a basic overview of the differences based on different generalized workloads.

Summary

Rust's Arc is a crucial feature for enabling safe, concurrent access to shared data cross threads. unlike Rc, Arc uses atomic operations to manage reference counting, ensuring thread safety but introducing some performance overhead due to the cost of atomic operations.

It's important to consider your specific use case and the tradeoffs that come with how you share data across threads.

Stay tuned for future articles that dive into alternatives to Arc and MPSC for concurrent computing in Rust.