An adventure in SIMD 2

February 27, 2020

For those of you that may have been curious about how to get started with SIMD in Rust after reading the previous post, I thought I would offer a practical example. There have been various attempts at building wrappers around the low-level types and instructions that Rust offers when using SIMD but I’ll be using the standard library in the examples. I think learning to use these will only make using another library easier if you choose to do so later down the line.

SIMD Instruction sets

SIMD comes in a few flavours, the most basic being MMX, followed by SSE and finally the more modern AVX variants. Each generation of SIMD provides larger vector sizes. Today we will be focusing on SSE instructions as they have the 128-bit vectors I discussed in the previous post.

Intel has a handy website for looking up instructions and what variant they fall into. Not all of the instructions are available in Rust but it will serve as a good reference point.

Intel Intrinsics Guide

Importing the instructions

The first thing we want to do is bring everything into scope

use std::arch::x86_64::*;

One very important note before we get started, to use SIMD we must use unsafe Rust. That’s because we are accessing some low-level features and its easy to make mistakes here. Please assume all instructions listed in the code snippets have been wrapped in an unsafe block.

Instruction format

The instructions can most often be broken down into three parts:

_prefix_ operation _type

The _prefix_ can largely be ignored, it’s usually _mm_ which stands for multimedia. That’s a remnant of the past when SIMD was used predominantly for increasing performance in multimedia applications.

The operation section lets us know what action we want to perform. This is where our Intel guide comes in handy. The guide also provides a short description of each of the instructions. While it may appear obvious what the operation is doing, some instructions are very similarly named. My advice would be to always look at the descriptions before you decide to use one.

Finally, the _type section tells you what numerical type this operation is expecting. In our case, ps is for f32 and will be the type we stick with throughout the post. You would look for instructions with the type _epi32 for i32’s for example.

Initialising a SIMD vector

There are a few ways to initialise a SIMD vector, we’ll start with explicitly declaring the elements of a 128-bit floating-point vector.

let first_vector = _mm_set_ps(3.0, 2.0, 1.0, 0.0);
dbg!(first_vector) // __m128(0.0, 1.0, 2.0, 3.0,)

The first thing to notice is that the elements are declared in reverse order. Below is an extract from the Rust documentation:

pub unsafe fn _mm_set_ps(a: f32, b: f32, c: f32, d: f32) -> __m128
Note that a will be the highest 32 bits of the result, and d the lowest. This matches the standard way of writing bit patterns on x86:

Next, we need to create another vector to carry out a SIMD operation. This time we’ll use another method of initialisation.

let increment = _mm_set1_ps(2.5);

Similar to the previous method of initialisation, we’ve created a SIMD vector with all 4 elements set to 2.5.

Next, let’s add the two vectors together using the add instruction.

let result = _mm_add_ps(first_vector, increment);
dbg!(result); // __m128(2.5, 3.5, 4.5, 5.5,)

Here we have an add operation on the type ps or f32 as noted earlier. After carrying out the add operation, we’re returned another SIMD vector. Although the array was initialised in reverse, the output is ordered correctly.

Reading from an array slice

In most cases, the array would come from another input source, so we need a method of reading the data and converting it into a SIMD vector.

let data = vec![-0.7, 3.2, 5.1, 0.5, 1.2, 30.0, -13.3, 9.0];

Provided the data source is divisible by 4 and at least 8 elements long, we could use the load instruction _mm_load_ps. This operation takes the first 128-bit’s and loads them into a __m128 SIMD vector.

let first_block = _mm_load_ps(&data[0]);
dbg!(first_block); // __m128(-0.7, 3.2, 5.1, 0.55,)

If you wanted to start loading from the second element, you may try and do the following.

let next_block = _mm_load_ps(&data[1]);
// Causes a runtime error.

But what you would be presented with is a segmentation fault, this is one of the reasons that we have to use unsafe Rust. Safe Rust would prevent this from happening but using unsafe we have opted out from this and other guards.

The reason for the segmentation fault lies in the requirements of the instruction we used. The load instruction expects to have a reference to data aligned at the 128-bit boundaries. That would translate into index 0 and 4, so in order to load from the second element you would need the _mm_loadu_ps instruction.

let last_block = _mm_loadu_ps(&data[1]);
dbg!(last_block); // __m128(3.2, 5.1, 0.5, 1.2,);

To ensure that we are always referencing into the array boundary with the _mm_load_ps instruction, there’s the chunks function.

data.chunks(4).for_each(|chunk| {
    let data_chunk = _mm_load_ps(&chunk[0]);
    // Do stuff
});

Bonus

To save you from the headache I experienced while trying to carry out a load on an array of i32’s, below is an example.

let data = vec![-7, 3, 5, 0, 1, 30, -13, 9];
let simd = _mm_load_si128(&data[0] as *const _ as *const __m128i);

You can’t cast directly into an __m128i here, as required by the load function, so you have to first cast into a raw pointer and finally into the type.

Converting into an array

Eventually, you’ll want to convert the __m128 type back into an array. Here are two possible solutions.

Transmuting

By passing the SIMD array into the transmute function, we can easily convert the type into a standard array.

let data = _mm_set_ps(3., 2., 1., 0.);
let array = std::mem::transmute::<__m128, [f32; 4]>(data);
dbg!(array); // [0.0, 1.0, 2.0, 3.0]

This only works because the __m128 type is essentially 4 f32’s in memory. If you’re unaware of the dangers of using this function, please check out the documentation.

Union

Another option which is the equivalent of a transmute, uses the union type. You can imagine a union to be a struct who’s fields carry out a transmute between the defined types. The Rust documentation lacks details regarding the union type but the Rust reference has some more detail.

union SimdToArray {
    array: [f32; 4],
    simd: __m128
}

let s2a = SimdToArray {
    simd: _mm_set_ps(3., 2., 1., 0.),
};

dbg!(s2a.array); // [0.0, 1.0, 2.0, 3.0]

The SIMD vector was stored into the simd field but later read from the array field.

Masking

Since we only have a limited amount of space in a SIMD vector, you may find you need to swap values based on a condition. This can be done using the blend instruction if its available but I found this to be slower than the following trick.

let array1 = _mm_set_ps(14., 5., 5., 4.);
//__m128(4., 5., 5., 14.)

let array2 = _mm_set_ps(6., 1., 7., 10.);
//__m128(10., 7., 1., 6.)

let low_mask = _mm_cmplt_ps(array1, array2);
// Equivalent to __m128(1., 1., 0., 0.)

let result = _mm_or_ps(
    _mm_and_ps(array1, low_mask),
    // __m128(4., 5., 0., 0.)
    _mm_andnot_ps(low_mask, _mm_set1_ps(0.99))
    // __m128(0., 0., 0.99, 0.99),
);

dbg!(result); // __m128(4., 5., 0.99, 0.99)

We start by creating two arrays and using the cmplt (compare less than) instruction to generate our mask. The masking trick is then formed of three parts.

_mm_and_ps - Here we’re comparing the mask to array1 and keeping the values where the mask is 1.

_mm_andnot_ps - We then flip the mask to (0., 0., 1., 1.) and set the values where the mask is 1 to .99 or any values we wanted.

_mm_or_ps - The first instruction returned (4., 5., 0., 0.) and the second (0., 0., 0.99, 0.99) so this or instruction produces (4., 5., 0.99, 0.99) as the result.

Conclusion

Hopefully, that gives you enough to get started with SIMD. It can be tricky in Rust as there’s not a lot of tutorials around but with a bit of persistence you’ll soon be on your way. Good luck!