An adventure in SIMD 2
February 27, 2020
For those of you that may have been curious about how to get started with SIMD in Rust after reading the previous post, I thought I would offer a practical example. There have been various attempts at building wrappers around the low-level types and instructions that Rust offers when using SIMD but I’ll be using the standard library in the examples. I think learning to use these will only make using another library easier if you choose to do so later down the line.
SIMD Instruction sets
SIMD comes in a few flavours, the most basic being MMX, followed by SSE and finally the more modern AVX variants. Each generation of SIMD provides larger vector sizes. Today we will be focusing on SSE instructions as they have the 128-bit vectors I discussed in the previous post.
Intel has a handy website for looking up instructions and what variant they fall into. Not all of the instructions are available in Rust but it will serve as a good reference point.
Importing the instructions
The first thing we want to do is bring everything into scope
use std::arch::x86_64::*;
One very important note before we get started, to use SIMD we must use unsafe
Rust. That’s because we are accessing some low-level features and its easy to make mistakes here. Please assume all instructions listed in the code
snippets have been wrapped in an unsafe
block.
Instruction format
The instructions can most often be broken down into three parts:
_prefix_
operation
_type
The _prefix_
can largely be ignored, it’s usually _mm_
which stands for multimedia. That’s a remnant of the past
when SIMD was used predominantly for increasing performance in multimedia applications.
The operation
section lets us know what action we want to perform. This is where our Intel guide comes in handy.
The guide also provides a short description of each of the instructions. While it may appear obvious what the operation is doing, some instructions are very similarly named. My advice would be to always look at the descriptions
before you decide to use one.
Finally, the _type
section tells you what numerical type this operation is expecting. In our case, ps
is for f32
and
will be the type we stick with throughout the post. You would look for instructions with the type _epi32
for i32
’s for example.
Initialising a SIMD vector
There are a few ways to initialise a SIMD vector, we’ll start with explicitly declaring the elements of a 128-bit
floating-point vector.
let first_vector = _mm_set_ps(3.0, 2.0, 1.0, 0.0);
dbg!(first_vector) // __m128(0.0, 1.0, 2.0, 3.0,)
The first thing to notice is that the elements are declared in reverse order. Below is an extract from the Rust documentation:
pub unsafe fn _mm_set_ps(a: f32, b: f32, c: f32, d: f32) -> __m128
Note that
a
will be the highest 32 bits of the result, andd
the lowest. This matches the standard way of writing bit patterns on x86:
Next, we need to create another vector to carry out a SIMD operation. This time we’ll use another method of initialisation.
let increment = _mm_set1_ps(2.5);
Similar to the previous method of initialisation, we’ve created a SIMD vector with all 4 elements set to 2.5.
Next, let’s add the two vectors together using the add
instruction.
let result = _mm_add_ps(first_vector, increment);
dbg!(result); // __m128(2.5, 3.5, 4.5, 5.5,)
Here we have an add
operation on the type ps
or f32
as noted earlier. After carrying out the add
operation, we’re returned another SIMD vector. Although the array was initialised in reverse, the output is ordered correctly.
Reading from an array slice
In most cases, the array would come from another input source, so we need a method of reading the data and converting it into a SIMD vector.
let data = vec![-0.7, 3.2, 5.1, 0.5, 1.2, 30.0, -13.3, 9.0];
Provided the data source is divisible by 4 and at least 8 elements long, we could use the load
instruction _mm_load_ps
.
This operation takes the first 128-bit
’s and loads them into a __m128
SIMD vector.
let first_block = _mm_load_ps(&data[0]);
dbg!(first_block); // __m128(-0.7, 3.2, 5.1, 0.55,)
If you wanted to start loading from the second element, you may try and do the following.
let next_block = _mm_load_ps(&data[1]);
// Causes a runtime error.
But what you would be presented with is a segmentation fault
, this is one of the reasons that we have to use unsafe
Rust. Safe Rust would prevent this from happening but using unsafe
we have opted out from this and other guards.
The reason for the segmentation fault
lies in the requirements of the instruction we used. The load
instruction expects to have a reference to data aligned at the 128-bit
boundaries.
That would translate into index 0
and 4
, so in order to load from the second element you would need the _mm_loadu_ps
instruction.
let last_block = _mm_loadu_ps(&data[1]);
dbg!(last_block); // __m128(3.2, 5.1, 0.5, 1.2,);
To ensure that we are always referencing into the array boundary with the _mm_load_ps
instruction, there’s the chunks
function.
data.chunks(4).for_each(|chunk| {
let data_chunk = _mm_load_ps(&chunk[0]);
// Do stuff
});
Bonus
To save you from the headache I experienced while trying to carry out a load on an array of i32
’s, below is an example.
let data = vec![-7, 3, 5, 0, 1, 30, -13, 9];
let simd = _mm_load_si128(&data[0] as *const _ as *const __m128i);
You can’t cast directly into an __m128i
here, as required by the load function, so you have to first cast into a raw
pointer and finally into the type.
Converting into an array
Eventually, you’ll want to convert the __m128
type back into an array. Here are two possible solutions.
Transmuting
By passing the SIMD array into the transmute
function, we can easily convert the type into a standard array.
let data = _mm_set_ps(3., 2., 1., 0.);
let array = std::mem::transmute::<__m128, [f32; 4]>(data);
dbg!(array); // [0.0, 1.0, 2.0, 3.0]
This only works because the __m128
type is essentially 4 f32
’s in memory. If you’re unaware of the dangers of using this function, please check out the documentation.
Union
Another option which is the equivalent of a transmute
, uses the union
type. You can imagine a union
to be a struct
who’s fields carry out a transmute
between the defined types.
The Rust documentation lacks details regarding the union
type but the Rust reference has some more detail.
union SimdToArray {
array: [f32; 4],
simd: __m128
}
let s2a = SimdToArray {
simd: _mm_set_ps(3., 2., 1., 0.),
};
dbg!(s2a.array); // [0.0, 1.0, 2.0, 3.0]
The SIMD vector was stored into the simd
field but later read from the array
field.
Masking
Since we only have a limited amount of space in a SIMD vector, you may find you need to swap values based on a condition. This can be done using the blend
instruction if its available but I found this to be
slower than the following trick.
let array1 = _mm_set_ps(14., 5., 5., 4.);
//__m128(4., 5., 5., 14.)
let array2 = _mm_set_ps(6., 1., 7., 10.);
//__m128(10., 7., 1., 6.)
let low_mask = _mm_cmplt_ps(array1, array2);
// Equivalent to __m128(1., 1., 0., 0.)
let result = _mm_or_ps(
_mm_and_ps(array1, low_mask),
// __m128(4., 5., 0., 0.)
_mm_andnot_ps(low_mask, _mm_set1_ps(0.99))
// __m128(0., 0., 0.99, 0.99),
);
dbg!(result); // __m128(4., 5., 0.99, 0.99)
We start by creating two arrays and using the cmplt
(compare less than) instruction to generate our mask. The masking trick is then formed of three parts.
_mm_and_ps
- Here we’re comparing the mask to array1
and keeping the values where the mask is 1
.
_mm_andnot_ps
- We then flip the mask to (0., 0., 1., 1.)
and set the values where the mask is 1
to .99
or any
values we wanted.
_mm_or_ps
- The first instruction returned (4., 5., 0., 0.)
and the second (0., 0., 0.99, 0.99)
so this or
instruction produces (4., 5., 0.99, 0.99)
as the result.
Conclusion
Hopefully, that gives you enough to get started with SIMD. It can be tricky in Rust as there’s not a lot of tutorials around but with a bit of persistence you’ll soon be on your way. Good luck!