Description
Datastore is an extension of datasets that interoprates with sqlite and memmap. The reason I created it was because I wanted a low resource way to do data processing on datasets. Arrow can be cumbersome for things like update. And we get the power of SQL and full text search. Note that the full text search is not a service that requires a server, but rather based on sqlite itself and not as flexible as the indexsearch @ggdupont is working on.
However, it would be cool to connect the ac_dc filtering and PII pipeline to datastore so we can do things like
- load dataset X
- ac/dc filter
- PII process
- full text index in sqlite
- run through distiluse and memmap vector to memmap column
- and visualize a subset based on perpleixty param, registry param and full text search
This is not an immediate need, but it would provide a low compute, low resource (no servers needed) tool that will promote equal access to language tech to different researchers around the world.