Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convenient way to download all data connected to a flow #205

Open
JaGeo opened this issue Nov 5, 2024 · 3 comments
Open

Convenient way to download all data connected to a flow #205

JaGeo opened this issue Nov 5, 2024 · 3 comments

Comments

@JaGeo
Copy link
Collaborator

JaGeo commented Nov 5, 2024

Hi all,

I was wondering if there is currently a convenient way to download all data connected to a flow. While I can retrieve data for a certain job, I haven't found such an option for a whole flow. I would assume that this might be something other users would be interested in as well.

@gpetretto
Copy link
Contributor

Hi @JaGeo,

when you mention the data connected to a Flow, are you referring to the Flow structure (e.g. the list of all the Jobs information plus the connections between the Jobs) or to the Job outputs? Or both?
Which functionality exactly that is present for Jobs would you like to have available for the whole Flow?

@JaGeo
Copy link
Collaborator Author

JaGeo commented Nov 6, 2024

@gpetretto I would like to be able to download all raw data, but ideally, such a download includes the information about the flow and its jobs as well. I did not think about this last part at first, but would make reconstructing the data much easier.

It would be nice to have some kind of archive option. Download of all raw data into one folder, get all outputs from the database and add all job connections. Or do you have a better solution in mind for moving data to long-term storage?

@gpetretto
Copy link
Contributor

I see, sorry, I had mistakenly assumed that you were referring to the content of the DB and not the raw data. So basically an equivalent of jf job get but for flows.
An example could be a command like jf flow files get, that for a specific flow (or maybe many flows, based on results selected with a query) downloads the data from the worker and puts them in an organized folder? e.g. uuid of the flow as main folder, than it can also use the uuid of the hosts to group jobs accordingly to the subflows, then each job in a folder with "{job_name}{uuid}{index}". And maybe a dump of the connections in the main flow folder?

I would tend to keep separated the backup of the raw data and that of the output Store, because if the outputs document are split and dumped in the corresponding job folder it may be difficult to reconstruct the output Store afterwards, if needed. Since there are many kinds of Stores, I was inclined to think that it could be good to rely on specific tools for their backup. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants