Apache Arrow provides a common specification for exchanging data on memory. As Yosegi also exchanges data with other OSS, it supports Apache Arrow.
Provide I/O functions to Apache Arrow. In reading, it provides the function to read from the file in the Apache Arrow format. In writing, it provides the function to write from the Apache Arrow format to the Yosegi file. Also, Yosegi is implemented in Java, so you can not read and write files directly in other languages. Therefore, Yosegi supports by providing a command to create a file in Apache Arrow format.
Currently it does not support the function to create Yosegi files directly from other programming languages.
Provides the function to read the Yosegi file with an array of ValueVector.
Apache Arrow has BufferAllocator for managing memory. When creating a column, allocate memory from BufferAllocator. Yosegi read unit is Spread. When reading the Yosegi file, create BufferAllocator with Reader and reset BufferAllocator when reading Spread.
Since Yosegi does not have a schema, the schema may be different for each Spread. Apache Arrow also does not need a schema when creating data structures. However, since the schema to be read is decided in the query engine and the like, there are cases where an error occurs if the data structure and the schema are different. When converting from Yosegi to Apache Arrow, support a function to read when schema to be read is decided and a function to read without schema.
The column is expressed as ValueVector. ValueVector is prepared for each type of data. When creating a ValueVector, if a schema is specified, create a ValueVector of the corresponding type. If not specified, create it with the same data type as Yosegi column. When setting data, ValueVector converts it to the expected object and sets the data.
Yosegi does not require schema information at the time of writing. Write data with the data structure as it is.
Since ValueVector is already the same information as column, it does not copy data to Column. Yosegi prepares class with interface of IColumn and wraps ValueVector on that class. This reduces processing costs associated with parsing data, so it is faster.