SVD register clusters generate inefficient machine code #46

jnkr-ifx · 2024-11-28T19:16:37Z

Is your feature request related to a problem? Please describe.

svd2pac generates accessor functions for register clusters that return an array of all the registers within the cluster:

impl Gpio {
    #[inline(always)]
    pub fn prt(self) -> [Prt; 15] {
        unsafe {
            [
                Prt {
                    ptr: self.ptr.add(0x0usize + 0x0usize),
                },
                Prt {
                    ptr: self.ptr.add(0x0usize + 0x80usize),
                },
                Prt {
                    ptr: self.ptr.add(0x0usize + 0x100usize),
                },
                Prt {
                    ptr: self.ptr.add(0x0usize + 0x180usize),
                },
                // ...
            ]
        }
    }
}

Now consider a function that accesses a register:

fn write_to_port(port: u8) {
    unsafe { GPIO.prt()[port as usize].write(42); }
}

If this function is inlined and the compiler is able to statically tell which index in the port cluster will be accessed, it can optimize out the array and issue a write directly to the appropriate port register. But if the compiler decides not to inline the function for code size reasons, or there's some other reason it does not statically know what port will be accessed, then the generated code is extremely poor: the program constructs the entire array, one element at a time, so that it can do the array index lookup to find the register to write: https://rust.godbolt.org/z/Pa31oW4PT

The SVD file I'm working with has some register clusters that can have up to 1023 registers, and a single write to one of these registers generates 7.8kB of code!

Describe the solution you'd like
I would like to be able to write the write_to_port function in a way that generates simple pointer arithmetic instead of creating the entire array only to access a single element.

I experimented with constructing the array in different ways, but it seems LLVM is not able to optimize out the array entirely when the index of the element to access is not statically known. Thus, it's not possible to ensure efficient code generation without changing the function signature to return something other than an array.

I think the simplest thing to do is to change the accessor function (or add a new function) to take the index as a parameter: https://rust.godbolt.org/z/c3ePWTxYq

pub fn prt(self, i: usize) -> Prt {
    unsafe {
        assert!(i < 15, "index out of range");
        Prt {
            ptr: self.ptr.add(0x0usize + i*0x80usize)
        }
    }
}

fn write_to_port(port: u8) {
    unsafe { GPIO.prt(port as usize).write(42); }
}

This has the disadvantage of not exposing the length and iterator methods of an array. This is not necessary (or even desirable) for PSOC, because our SVD files define clusters with the maximum possible number of registers that the MMIO interface supports, rather than the actual amount that exist on the hardware. But if that's useful functionality for other applications, it could be implemented by returning an accessor object that provides these methods: https://rust.godbolt.org/z/4496oxqoh

impl Gpio {
    #[inline(always)]
    pub fn prt(self) -> Cluster<Prt, 15, 0x80> {
        unsafe { Cluster::new(self.ptr) }
    }
}

#[derive(Copy, Clone)]
#[allow(private_bounds)]
pub struct Cluster<T: FromPtr, const N: usize, const DIM: usize> {
    ptr: *mut u8,
    _t: ::core::marker::PhantomData<T>
}

#[allow(private_bounds)]
impl<T: FromPtr, const N: usize, const DIM: usize> Cluster<T, N, DIM> {
    pub fn get(&self, i: usize) -> T {
        unsafe {
            assert!(i < N, "index out of range");
            T::from_ptr(self.ptr.add(i*DIM))
        }
    }

    pub fn len(&self) -> usize {
        N
    }

    pub fn iter(&self) -> impl ExactSizeIterator<Item=T> {
        let s = *self;
        (0..N).map(move |i| s.get(i))
    }

    unsafe fn new(ptr: *mut u8) -> Self {
        Cluster { 
            ptr, 
            _t: ::core::marker::PhantomData
        }
    }
}

trait FromPtr: Copy + Clone {
    fn from_ptr(ptr: *mut u8) -> Self;
}
impl FromPtr for Prt {
    fn from_ptr(ptr: *mut u8) -> Self {
        Prt { ptr }
    }
}

I'm happy to send a PR if either of these options are acceptable.

Describe alternatives you've considered
The code size bloat could be mitigated somewhat by using core::array::from_fn to initialize the array: https://rust.godbolt.org/z/f6vzW9Yoa

This generates a loop to initialize the register array, instead of forcing it to be unrolled. But it's far from optimal as it still creates the entire array, computing the address of every register in the cluster instead of just the one that's needed.

The text was updated successfully, but these errors were encountered:

jnkr-ifx · 2024-12-03T01:18:56Z

I spent some time today prototyping the idea of a cluster type that allows normal array indexing, and I wasn't able to come to a solution that I felt satisfied with. The idea of using a zero-sized type to be able to create a reference that can be returned from the Index trait is a good idea, but it still has the problem that some sort of getter method is needed to turn the returned reference into a normal register type. So instead of something like GPIO.prt().get(i).out().write(...) as with my original proposal, you have cluster[i].get().out().write(...), which doesn't strike me as much better (there's not anything you could do with cluster[i] besides just calling get() to turn it into a Prt).

I was hoping we could implement Deref on the reference returned by the indexing operation, so that you could do something like cluster[i].out().write(...) as with the original array type. However, that does not work because Deref::deref is expected to return an & reference, which leaves us with the same problem we had with Index::index.

One solution would be to represent all register types as a reference to a zero-sized type rather than a struct with a pointer. It would even be a completely source-compatible change, and it would look something like this:

pub struct _Gpio;
pub type Gpio = &'static _Gpio;

However, the big disadvantage to this is that it makes documentation much less clear: the Gpio functions show up on this new _Gpio struct in rustdoc.

I experimented with different ways of representing indexing using slices, or trying to take other ideas from the bitvec crate, but unfortunately the trick they use to pull off indexing doesn't work here:

impl<T, O> Index<usize> for BitSlice<T, O>
where
	T: BitStore,
	O: BitOrder,
{
	type Output = bool;

	fn index(&self, index: usize) -> &Self::Output {
		match *index.index(self) {
			true => &true,
			false => &false,
		}
	}
}

They're working with bits that only have two possible states, so they can store both states as constants in static memory and return references to them. Whereas indexing a register cluster can return thousands of possible values.

Any approach using slices to represent the strides/sizes of registers is likely unsound because Rust currently requires any reference to a non-zero-sized type to be dereferenceable at all times (i.e. the compiler can insert spurious dereferences); thus, it is unsound to create a non-zero-sized references with an address in the MMIO register space. (See rust-lang/unsafe-code-guidelines#411 for more information on this.)

So, I feel like the best way forward is the second option I originally proposed, (cluster.get(i)), as it lets us provide functionality like a len function, IntoIterator implementations, etc. on the cluster type. But any of the options discussed are OK with us in ICW.

pellico · 2024-12-06T17:16:26Z

So instead of something like GPIO.prt().get(i).out().write(...) as with my original proposal, you have cluster[i].get().out().write(...), which doesn't strike me as much better (there's not anything you could do with cluster[i] besides just calling get() to turn it into a Prt).

My proposal is more exactly like that

cluster[i].register_name().read() to read the register

the register API will be implemented on the reference to cluster

One solution would be to represent all register types as a reference to a zero-sized type rather than a struct with a pointer. It would even be a completely source-compatible change, and it would look something like this:

pub struct _Gpio;
pub type Gpio = &'static _Gpio;

However, the big disadvantage to this is that it makes documentation much less clear: the Gpio functions show up on this new _Gpio struct in rustdoc.

I don't find a too much complex type compared to other solutions as svd2rust. If this is the only drawback I still think thisi is the presently best solution.
However I think it is more a matter of taste.

@andreasWallnerIFX what do you think ?

If we are not able to come an agreement. Maybe we can prototype both solution to verify if there are some other technical issue and then present these prototype to more users.

pellico added this to the Release 0.4.0 milestone Dec 3, 2024

pellico added the enhancement New feature or request label Dec 6, 2024

pellico assigned pellico, jnkr-ifx and andreasWallnerIFX Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SVD register clusters generate inefficient machine code #46

SVD register clusters generate inefficient machine code #46

jnkr-ifx commented Nov 28, 2024

jnkr-ifx commented Dec 3, 2024 •

edited

Loading

pellico commented Dec 6, 2024

SVD register clusters generate inefficient machine code #46

SVD register clusters generate inefficient machine code #46

Comments

jnkr-ifx commented Nov 28, 2024

jnkr-ifx commented Dec 3, 2024 • edited Loading

pellico commented Dec 6, 2024

jnkr-ifx commented Dec 3, 2024 •

edited

Loading