-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend the CSVLoader class to read from different datasources/targets and different kinds of formats #70
Comments
The With respect to transparently reading compressed formats, we could add that as an option to the previously mentioned data sources. For It's intentional that the current loaders do not read from remote endpoints (apart from |
I'm finishing off a tutorial on |
@Craigacp I agree with the points above, maybe I wasn't clear enough. I meant inheriting from The parent class of implementation is an implementation detail, it's just to provide more ways to load data and capture the source is what I was eluding to. |
About compressed files, directory/folder support won't be necessary, just allowing it to detect that it's compressed csv/json file is more than enough. There may be use-cases for such usage as I have already seen that when data files get used, lots of different lightweight formats are sought after to solve read/write issues (storage and latency). |
We already have something that can transparently figure out if it's a GZipped file elsewhere in OLCUT, which will return the appropriate input stream implementation. We could probably extend that to support zip, but I don't think we'd want to induce a dependency inside OLCUT to get bzip support. |
So concretely there would be:
For the last point I'm not clear what's required. Tribuo can already connect to things via JDBC, and read delimited and json format inputs. Are there other major formats we should support? We use |
Is your feature request related to a problem? Please describe.
At the moment it appears the
CSVLoader
can only load.csv
files from the disk or a file system. Which could be a limitation both from the functionality point of view and also provenance (metadata) recording point of view.In the Provenance data, we see the path of the file given during the training process, this path could be invalid if the process was run in docker containers or another ephemeral means.
Other libraries (non-Java based) allow loading
.tgz
,.zip
, etc formats and although this may just be a single step when trying to manage multiple datasets this can be a boon.Describe the solution you'd like
CSVLoader
through sub-class implementations allow loading:.tgz
,.zip
(compressed formats mainly)Additional context
Maybe show these functionalities or other functionalities or features of
CSVLoader
via notebook tutorials.This request is actually two folds:
Once any or all of these are established, the provenance information can now have a bit more independent set of information on how to replicate the data loading process.
For e.g.
From the above I could not recreate the model building process or just the data loading process easily because
path = file:/Users/apocock/Development/Tribuo/tutorials/bezdekIris.data
is local an individual computer system. While we could have paths likepath = https://path/to/bezdekIris.data
which would make the whole process a lot more independent. And also add value to the provenance metadata, as we would know the original source of the data.The text was updated successfully, but these errors were encountered: