Skip to content

Tracking Issue: Extension DTypes, Scalars, and Arrays #6547

@a10y

Description

@a10y

Discussed in #6500

Originally posted by connortsui20 February 13, 2026

More Robust Extension Data Types In Vortex

We would like to build a more robust system for extension data types (or DTypes).

#6081 introduced vtables for extension DTypes. Each extension type (e.g. Timestamp) now implements ExtDTypeVTable, which handles validation, serialization, and metadata. The type-erased ExtDTypeRef carries this vtable with it inside DType::Extension.

The natural next steps would be to add analogous vtables for vortex-scalar (e.g., custom Display and casting) and vortex-array (custom compute kernels). This would give us three traits:

ExtDTypeVTable   (vortex-dtype)
ExtScalarVTable  (vortex-scalar)
ExtArrayVTable   (vortex-array)

Issues

There is a problem with this. This has come up a few times in the effort to add the scalar extension vtable to scalar extension values: #6477.

Once an ExtDType is type-erased to ExtDTypeRef, the only thing it carries is the dtype vtable (ONLY the dtype, not the scalar or array vtables). Suppose you have an ExtensionArray and want to call scalar or array logic: you then need to look up the other vtables by ExtID in a session registry. This means threading &VortexSession through every code path that touches extension types (this would be things like compute kernels, builders, canonicalization, display, etc.)

This is kind of torturous because the dtype vtable is literally right there inside the DType, but the scalar/array vtables require a registry lookup to find. If the vtables were combined we would not have this issue. So to fix this, sessions need to be plumbed through APIs that otherwise have no reason to take one (in other words, many constructors would need to take a session if they could potentially create an extension type).

What we probably want is a single ExtVTable per extension dtype that covers all three layers, so that when you have an ExtDTypeRef you already have everything you need.

Crate Dependency Graph!

The crate dependency graph is:

vortex-array --(depends on)--> vortex-scalar --(depends on)--> vortex-dtype

A unified vtable trait would need to reference types from all three crates, which is impossible when the trait lives in vortex-dtype, which can't depend on vortex-scalar or vortex-array.


Potential Solutions

Here are some potential solutions, some uglier than others...

Make the VortexSession a global static

This is not great for hygiene, but it would mean that everything can access the session and look up vtables without having to pass VortexSession around everywhere.

Of course, if this is worth adding a global execution context is up for debate.

Merge the Crates

In my opinion, this is a better solution than the above.

If vortex-dtype, vortex-scalar, and vortex-array were a single crate (or at least the extension vtable machinery lived in one place that could see all three), we could define:

pub trait ExtVTable: 'static + Send + Sync + ... {
    type Metadata: ...;

    // `DType`

    fn id(&self) -> ExtID;
    fn validate(&self, metadata: &Self::Metadata, storage: &DType) -> VortexResult<()>;
    fn serialize(&self, metadata: &Self::Metadata) -> VortexResult<Vec<u8>>;
    fn deserialize(&self, data: &[u8]) -> VortexResult<Self::Metadata>;

    // `Scalar`

    // (This is not how it actually would look, but close enough)
    fn display(&self, metadata: &Self::Metadata, value: &ScalarValue, f: &mut fmt::Formatter) -> fmt::Result { ... }
    fn cast(&self, ...) -> VortexResult<Scalar> { ... }

    // `ArrayRef`

    fn cast_array(&self, ...) -> VortexResult<ArrayRef> { ... }
    // <-- Probably a lot more than this -->
}

ScalarValues should store ArrayRef instead of Vec<Option<ScalarValue>>

Another thing that I have yet to mention is that we probably want to have ScalarValues that can hold an ArrayRef directly. Right now, scalar lists are stored as Vec<Option<ScalarValue>>, which is extremely heavyweight. You can imagine for an extension type like a Tensor that scalars would instantly become a bottleneck for any compute operations like matrix multiplication.

This is impossible with the current crate structure as vortex-array depends on vortex-scalar, so we cannot store arrays inside scalars.

Arguably, the fact that we cannot do this is the only reason that our scalars are not performant. This is the only variant that currently makes an owned allocation on creation (as opposed to shared allocations like ByteBuffer).

Open Questions

  • Are there other approaches that avoid merging the crates or having global static variables? We haven't been able to think of any.
  • Is the crate split between dtype/scalar/array load-bearing for compile times or other reasons (I strongly doubt this)?
  • Are there extension-array operations that shouldn't be bundled into the vtable?
  • Is this overkill or underkill?
  • It might be a good idea to figure out what exactly we want from extension types. The extension types we know want are tensors and UUID, but it might be a good idea to figure out what kinds of APIs they need and what a clean interface with Vortex might look like.

CC @gatesn

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions