Skip to content

Connect data filtering, visualization, PII pipeline into datastore #262

Open
@huu4ontocord

Description

@huu4ontocord

Datastore is an extension of datasets that interoprates with sqlite and memmap. The reason I created it was because I wanted a low resource way to do data processing on datasets. Arrow can be cumbersome for things like update. And we get the power of SQL and full text search. Note that the full text search is not a service that requires a server, but rather based on sqlite itself and not as flexible as the indexsearch @ggdupont is working on.

However, it would be cool to connect the ac_dc filtering and PII pipeline to datastore so we can do things like

  • load dataset X
  • ac/dc filter
  • PII process
  • full text index in sqlite
  • run through distiluse and memmap vector to memmap column
  • and visualize a subset based on perpleixty param, registry param and full text search

This is not an immediate need, but it would provide a low compute, low resource (no servers needed) tool that will promote equal access to language tech to different researchers around the world.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions