addition/subtraction acceleration via inline assembly #314

twiby · 2024-07-08T12:21:06Z

During the addition/subtraction loops, there is performance to be gained by specifying the exact assembly instructions. The performance gain start to be felt after 1,000bit integers, and are maximal around 10,000bits, where we are near 2x performance.

Documentation for assembly instructions can be found here

Added assembly

Principle

The main idea is to avoid any instruction that are modifying the carry flags, except for adc and sbb of course, so that the carry is propagated through the whole loop. This is doable if we use a decreasing index for the loop counter and a jnz instruction. The indices we use have to be updated using inc anddec instructions (those instructions aren't modifying the carry flag).

Batching

To avoid overhead by looping digits one by one, we fill the 16 registers as much as possible. This enables us to treat digits 5 by 5 in one step (enabling efficient CPU pipelining).

A consequence of this is that a number of digits is still not added after the inline assembly code is run (the number is strictly less than 5). So we keep the existing code to handle the remaining digits.

Safety

Obviously this is very much unsafe. On my personal project, where I want to let people the option to rely on safe Rust, I put these functions behind a feature gate. If necessary, I can take a similar approach here.

Performance

I've measured performance by using the criterion crate, and adding 2 BigUint of different sizes. The results are below. On some scales, this gives close a 2x gain ! To me the performance gain is clear enough to be non-ambiguous, though I'm not a benchmarking expert, and I'm sure a finer evaluation could be interesting.

Style

For easy review, the 2 added functions are in the files addition and subtraction, but I'd argue in the end, it'd be better to put them in a dedicated sub-module, they look far too different.

Also I'm notoriously average at naming things, so suggestions are welcome ^^

cuviper · 2024-07-08T19:00:16Z

Can you share your benchmark code somewhere? I'd like to compare results with various Intel and AMD machines, and also when compared to -Ctarget-cpu=native. I'm not sure how much the CPU will matter when we're already using adc/sbb intrinsics, but it's worth checking.

In general, I would rather see Rust-native changes that coax LLVM toward the codegen we want, if that's possible, because that's more likely to play well with target-cpu and even other architectures.

Also, I'm pretty sure this needs more cfg to avoid breaking x32 -- see the recent #311 and #312.

twiby · 2024-07-09T08:07:13Z

native rust concern

I've tried to play around in godbolt, but I could not generate the assembly I want, which is quite brittle: as soon as any instruction touches the carry flag (add or sub for loop indices, or even cmp), you have to add instructions to store the carry flag, and then conditionally set it back before the adc calls (a true 50% branch condition). I'll try again to see if I can get more luck

x32

thanks, I did not know target_arch = "x86_64" wasn't enough ! I don't have a 32 bit machine, but I'll try to fix those anyway. Also This whole trick should be applied to 32 bit machines as well, with almost exactly the same assembly, but not having a 32 bit machine, I'll not risk it.

benchmark code

My benchmark code is below, it is not particularly fancy:

use num_bigint::BigUint;

use rand::distributions::Standard;
use rand::prelude::Distribution;
use rand::{thread_rng, Rng};

use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn gen_n_random_values<T>(n: usize) -> Vec<T>
where
    Standard: Distribution<T>,
{
    let mut ret = Vec::<T>::with_capacity(n);
    for _ in 0..n {
        ret.push(thread_rng().gen::<T>());
    }
    ret
}

pub fn add<const N: usize>(c: &mut Criterion) {
    let mut name = "num-bigint add ".to_string();
    name.push_str(&N.to_string());

    println!("nb digits: {:?}", (N / 32) * 32);
    let n1 = BigUint::new(gen_n_random_values::<u32>(N / 32));
    let n2 = BigUint::new(gen_n_random_values::<u32>(N / 32));
    c.bench_function(name.as_str(), |b| b.iter(|| black_box(&n1 + &n2)));
}

criterion_group!(
    num_bigint_add,
    add<10>,
    add<30>,
    add<100>,
    add<300>,
    add<1_000>,
    add<3_000>,
    add<10_000>,
    add<30_000>,
    add<100_000>,
    add<300_000>,
    add<1_000_000>,
    add<3_000_000>,
    add<10_000_000>,
);

pub fn sub<const N: usize>(c: &mut Criterion) {
    let mut name = "num-bigint sub ".to_string();
    name.push_str(&N.to_string());

    let n1 = BigUint::new(gen_n_random_values::<u32>(N / 32 + 1));
    let n2 = BigUint::new(gen_n_random_values::<u32>(N / 32));
    c.bench_function(name.as_str(), |b| b.iter(|| black_box(&n1 - &n2)));
}

criterion_group!(
    biguint_sub,
    sub<1_000>,
    sub<3_000>,
    sub<10_000>,
    sub<30_000>,
    sub<100_000>,
    sub<300_000>,
    sub<1_000_000>,
    sub<3_000_000>,
    sub<10_000_000>,
);

pub fn mul<const N: usize>(c: &mut Criterion) {
    let mut name = "num-bigint mul ".to_string();
    name.push_str(&N.to_string());

    println!("nb digits: {:?}", (N / 32) * 32);
    let n1 = BigUint::new(gen_n_random_values::<u32>(N / 32));
    let n2 = BigUint::new(gen_n_random_values::<u32>(N / 32));
    c.bench_function(name.as_str(), |b| b.iter(|| black_box(&n1 * &n2)));
}

criterion_group!(
    biguint_mul,
    mul<10>,
    mul<30>,
    mul<100>,
    mul<300>,
    mul<1_000>,
    mul<3_000>,
    mul<10_000>,
    mul<30_000>,
);

criterion_main!(num_bigint_add, biguint_sub, biguint_mul);

twiby · 2024-07-09T08:23:58Z

I've pushed the fix for 32bit pointer width targets.
I was able to cross-compile for target i686-pc-windows-msvc, but could not run the tests.
I was able to compile+test on my raspberry pi (arm target)

cuviper · 2024-07-10T01:47:08Z

By x32, I mean the odd target of x86_64-unknown-linux-gnux32, which is using a 64-bit CPU with 32-bit pointers. In Rust that looks like cfg(all(target_arch="x86_64", target_pointer_width="32")). See https://en.wikipedia.org/wiki/X32_ABI

Thanks for your benchmark code - I'll take a closer look later.

twiby · 2024-07-10T07:43:59Z

Thank you for the precision ! It did compile fine for x86_64-unknown-linux-gnux32 but with warnings. I've pushed changes to remove the warnings.

Thank you for taking the time to look at this !

twiby · 2024-07-17T13:15:19Z

Have you had the time to look at benchmarks ? I'm very curious about the behavior on other machines !

twiby · 2024-07-20T13:42:27Z

For reference, this equivalent experimental implementation (without the asm macro) produces a much shorter assembly, but less "pipeplineable", and has performance very similar to current master branch. I could not make it more efficient, via trying for example different batch sizes.

All in all, the current PR still has the best performance by far of the approaches I tried.

#[cfg(target_arch = "x86_64")]
cfg_64!(
    /// Performs a part of the addition. Returns a tuple containing the carry state
    /// and the number of integers that were added
    ///
    /// By using as many registers as possible, we treat digits 5 by 5
    unsafe fn schoolbook_add_assign_x86_64(
        lhs: &mut [u64],
        rhs: &[u64],
        mut size: usize,
    ) -> (bool, usize) {
        if size < 5 {
            return (false, 0);
        }
        size -= 5;

        let mut c = 0u8;
        let mut idx = 0;

        while idx < size {
            c = unsafe {
                arch::_addcarry_u64(
                    c,
                    *lhs.get_unchecked(idx),
                    *rhs.get_unchecked(idx),
                    lhs.get_unchecked_mut(idx),
                )
            };
            idx += 1;
            c = unsafe {
                arch::_addcarry_u64(
                    c,
                    *lhs.get_unchecked(idx),
                    *rhs.get_unchecked(idx),
                    lhs.get_unchecked_mut(idx),
                )
            };
            idx += 1;
            c = unsafe {
                arch::_addcarry_u64(
                    c,
                    *lhs.get_unchecked(idx),
                    *rhs.get_unchecked(idx),
                    lhs.get_unchecked_mut(idx),
                )
            };
            idx += 1;
            c = unsafe {
                arch::_addcarry_u64(
                    c,
                    *lhs.get_unchecked(idx),
                    *rhs.get_unchecked(idx),
                    lhs.get_unchecked_mut(idx),
                )
            };
            idx += 1;
            c = unsafe {
                arch::_addcarry_u64(
                    c,
                    *lhs.get_unchecked(idx),
                    *rhs.get_unchecked(idx),
                    lhs.get_unchecked_mut(idx),
                )
            };
            idx += 1;
        }

        (c > 0, idx)
    }
);

Assembly generated:

schoolbook_add_assign_x86_64:
        cmp     r8, 5
        jb      .LBB1_1
        add     r8, -5
        je      .LBB1_1
        mov     rax, rdx
        mov     edx, 4
        xor     ecx, ecx
.LBB1_4:
        mov     rsi, rdx
        mov     rdx, qword ptr [rax + 8*rdx - 32]
        add     cl, -1
        adc     qword ptr [rdi + 8*rsi - 32], rdx
        mov     rcx, qword ptr [rax + 8*rsi - 24]
        adc     qword ptr [rdi + 8*rsi - 24], rcx
        mov     rcx, qword ptr [rax + 8*rsi - 16]
        adc     qword ptr [rdi + 8*rsi - 16], rcx
        mov     rcx, qword ptr [rax + 8*rsi - 8]
        adc     qword ptr [rdi + 8*rsi - 8], rcx
        mov     rcx, qword ptr [rax + 8*rsi]
        adc     qword ptr [rdi + 8*rsi], rcx
        setb    cl
        lea     rdx, [rsi + 5]
        inc     rsi
        cmp     rsi, r8
        jb      .LBB1_4
        test    cl, cl
        setne   al
        add     rdx, -4
        ret
.LBB1_1:
        xor     edx, edx
        xor     eax, eax
        ret

In particular, I cannot make the assembly use more registers in a single batch, so the additions are not pipelined together.

On the other end, I think was wrong thinking overwriting the CF flag is a branching issue: it is reset via the add cl, -1 instruction (still a bit expensive). The penalty is not as big as a true branching at all.

feat: addition/subtraction acceleration via inline assembly

1d9ddc6

fix: 32bit compilation target

03201c1

twiby force-pushed the master branch from 30fbcc1 to 03201c1 Compare July 9, 2024 08:33

remove warnings when compiling for x86_64-unknown-linux-gnux32

2dbbaab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

addition/subtraction acceleration via inline assembly #314

addition/subtraction acceleration via inline assembly #314

twiby commented Jul 8, 2024

cuviper commented Jul 8, 2024

twiby commented Jul 9, 2024

twiby commented Jul 9, 2024 •

edited

Loading

cuviper commented Jul 10, 2024

twiby commented Jul 10, 2024

twiby commented Jul 17, 2024 •

edited

Loading

twiby commented Jul 20, 2024

addition/subtraction acceleration via inline assembly #314

Are you sure you want to change the base?

addition/subtraction acceleration via inline assembly #314

Conversation

twiby commented Jul 8, 2024

Added assembly

Principle

Batching

Safety

Performance

Style

cuviper commented Jul 8, 2024

twiby commented Jul 9, 2024

native rust concern

x32

benchmark code

twiby commented Jul 9, 2024 • edited Loading

cuviper commented Jul 10, 2024

twiby commented Jul 10, 2024

twiby commented Jul 17, 2024 • edited Loading

twiby commented Jul 20, 2024

twiby commented Jul 9, 2024 •

edited

Loading

twiby commented Jul 17, 2024 •

edited

Loading