Tracking Issue: Extension DTypes, Scalars, and Arrays


### Discussed in https://github.com/vortex-data/vortex/discussions/6500

<div type='discussions-op-text'>

<sup>Originally posted by **connortsui20** February 13, 2026</sup>
# More Robust Extension Data Types In Vortex

We would like to build a more robust system for extension data types (or `DType`s).

https://github.com/vortex-data/vortex/pull/6081 introduced vtables for extension DTypes. Each extension type (e.g. `Timestamp`) now implements `ExtDTypeVTable`, which handles validation, serialization, and metadata. The type-erased `ExtDTypeRef` carries this vtable with it inside `DType::Extension`.

The natural next steps would be to add analogous vtables for `vortex-scalar` (e.g., custom `Display` and casting) and `vortex-array` (custom compute kernels). This would give us three traits:

```
ExtDTypeVTable   (vortex-dtype)
ExtScalarVTable  (vortex-scalar)
ExtArrayVTable   (vortex-array)
```

## Issues

There is a problem with this. This has come up a few times in the effort to add the scalar extension vtable to scalar extension values: https://github.com/vortex-data/vortex/pull/6477.

Once an `ExtDType` is type-erased to `ExtDTypeRef`, the only thing it carries is the dtype vtable (ONLY the dtype, not the scalar or array vtables). Suppose you have an `ExtensionArray` and want to call scalar or array logic: you then need to look up the other vtables by `ExtID` in a session registry. This means threading `&VortexSession` through every code path that touches extension types (this would be things like compute kernels, builders, canonicalization, display, etc.)

This is kind of torturous because the dtype vtable is literally _right there_ inside the `DType`, but the scalar/array vtables require a registry lookup to find. If the vtables were combined we would not have this issue. So to fix this, sessions need to be plumbed through APIs that otherwise have no reason to take one (in other words, many constructors would need to take a session if they could _potentially_ create an extension type).

What we probably want is a single `ExtVTable` per extension dtype that covers all three layers, so that when you have an `ExtDTypeRef` you already have everything you need.

## Crate Dependency Graph!

The crate dependency graph is:

```
vortex-array --(depends on)--> vortex-scalar --(depends on)--> vortex-dtype
```

A unified vtable trait would need to reference types from all three crates, which is impossible when the trait lives in `vortex-dtype`, which can't depend on `vortex-scalar` or `vortex-array`.

---

# Potential Solutions

Here are some potential solutions, some uglier than others...

## Make the `VortexSession` a global static

This is not great for hygiene, but it _would_ mean that everything can access the session and look up vtables without having to pass `VortexSession` around everywhere.

Of course, if this is worth adding a global execution context is up for debate.

## Merge the Crates

In my opinion, this is a better solution than the above.

If `vortex-dtype`, `vortex-scalar`, and `vortex-array` were a single crate (or at least the extension vtable machinery lived in one place that could see all three), we could define:

```rust
pub trait ExtVTable: 'static + Send + Sync + ... {
    type Metadata: ...;

    // `DType`

    fn id(&self) -> ExtID;
    fn validate(&self, metadata: &Self::Metadata, storage: &DType) -> VortexResult<()>;
    fn serialize(&self, metadata: &Self::Metadata) -> VortexResult<Vec<u8>>;
    fn deserialize(&self, data: &[u8]) -> VortexResult<Self::Metadata>;

    // `Scalar`

    // (This is not how it actually would look, but close enough)
    fn display(&self, metadata: &Self::Metadata, value: &ScalarValue, f: &mut fmt::Formatter) -> fmt::Result { ... }
    fn cast(&self, ...) -> VortexResult<Scalar> { ... }

    // `ArrayRef`

    fn cast_array(&self, ...) -> VortexResult<ArrayRef> { ... }
    // <-- Probably a lot more than this -->
}
```

### `ScalarValue`s should store `ArrayRef` instead of `Vec<Option<ScalarValue>>`

Another thing that I have yet to mention is that we probably want to have `ScalarValue`s that can hold an `ArrayRef` directly. Right now, scalar lists are stored as `Vec<Option<ScalarValue>>`, which is **extremely** heavyweight. You can imagine for an extension type like a Tensor that scalars would instantly become a bottleneck for any compute operations like matrix multiplication.

This is impossible with the current crate structure as `vortex-array` depends on `vortex-scalar`, so we cannot store arrays inside scalars.

Arguably, the fact that we cannot do this is the _only_ reason that our scalars are not performant. This is the only variant that currently makes an owned allocation on creation (as opposed to shared allocations like `ByteBuffer`).

# Open Questions

- Are there other approaches that avoid merging the crates or having global static variables? We haven't been able to think of any.
- Is the crate split between dtype/scalar/array load-bearing for compile times or other reasons (_I strongly doubt this_)?
- Are there extension-array operations that _shouldn't_ be bundled into the vtable?
- Is this overkill or underkill?
- It might be a good idea to figure out what _exactly_ we want from extension types. The extension types we know want are tensors and UUID, but it might be a good idea to figure out what kinds of APIs they need and what a clean interface with Vortex might look like.

CC @gatesn </div>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking Issue: Extension DTypes, Scalars, and Arrays #6547

Discussed in #6500

More Robust Extension Data Types In Vortex

Issues

Crate Dependency Graph!

Potential Solutions

Make the `VortexSession` a global static

Merge the Crates

`ScalarValue`s should store `ArrayRef` instead of `Vec<Option<ScalarValue>>`

Open Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tracking Issue: Extension DTypes, Scalars, and Arrays #6547

Description

Discussed in #6500

More Robust Extension Data Types In Vortex

Issues

Crate Dependency Graph!

Potential Solutions

Make the VortexSession a global static

Merge the Crates

ScalarValues should store ArrayRef instead of Vec<Option<ScalarValue>>

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Make the `VortexSession` a global static

`ScalarValue`s should store `ArrayRef` instead of `Vec<Option<ScalarValue>>`