Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingestion of common/shared data #61

Open
sonjastndl opened this issue Jun 5, 2024 · 19 comments
Open

Ingestion of common/shared data #61

sonjastndl opened this issue Jun 5, 2024 · 19 comments
Assignees

Comments

@sonjastndl
Copy link

Hi everyone,

I am curious about the modalities of the shared s3 bucket on EOX Hub.

If I access data (Example: Requesting a .nc on Surface Map for the year 2019 via the CDS API) at the EOX Hub, I am currently storing that data in our UC specific bucket. I can imagine that happening for other Use Cases and data as well and I am wondering, if we are having a lot of redundant data in the UC specific buckets.

Are there any recommendations on how to handle this from EOX side @Schpidi @eox-cs1? Just copying files to the common s3 might cause a bit of a chaos due to naming and other reasons, that does not avoid redundancy.
I assume we will talk about this in our next weeks meeting, still it would be good to know if there are already some foreseen "best practices" and/or restrictions that need to be considered?

@KathiSchleidt @Susannaioni

@KathiSchleidt
Copy link
Member

@eox-cs1 @Schpidi we really need feedback on how to structure the S3 Buckets, how can UC partners:

  • make commonly accessible data available to all UC
  • keep sensitive data constrained to their UC
  • access both common and sensitive data via which server?

Related to the issue on server configuration, #64

@eox-cs1
Copy link

eox-cs1 commented Jun 17, 2024

* make commonly accessible data available to all UC

all UCs have access to a folder located at /shared/fairicube or /home//.shared/fairicube
In (below) this folder all Fairicube UCs can store data which is accessible by all UCs

* keep sensitive data constrained to their UC

every UC has its own s3 bucket associated. This is only accessible by the UC

* access both common and sensitive data via which server?

all UCs have access to a folder located at /shared/fairicube or /home//.shared/fairicube

The respective kernels (to be selected on the beginning of a session) influences the software tools available (as requested by each UC), but has no influence on the availability of the shared/private foldera (see above)
(see also #64)

@KathiSchleidt
Copy link
Member

@eox-cs1 where is this documented?

@BachirNILU
Copy link

Hi @eox-cs1,

Thank you for your responses!
I have a very specific question.
I have uploaded a data to s3 bucket under UC4 server option.
So I understand this bucket is specific to UC4 users.
The resources (RAM) was not sufficient to run my code, so I plan to use another sever option (UC1) that has 60 GB of RAM and would be sufficient to run my code.
To avoid duplicating the data, I want to run the code after connecting to UC1 server option, and access data in UC4's s3 bucket, How can I do that?

Thanks in advance,

Best regards,

-Bachir.

@eox-cs1
Copy link

eox-cs1 commented Jun 17, 2024

A direct cross-UC access is not foreseen - would somehow contradict the separation.

However, you can either use the common /shared/fairicube folder for data exchange/access
OR
you can use the respective secrets from the desired storage and manually inject them to your environment (in addition to the use-case specific one). This should also give you access to the UC4 s3 bucket from UC1,
OR
you ask for a larger machine for UC4 (if you will need this more frequently)

@sonjastndl
Copy link
Author

sonjastndl commented Jun 17, 2024

@mari-s4e
Hey Maria,
this is somehow contradictory to where we have been exchanging data right?
So for accessing the shared folder none of the actions documented in the FAIRiCUBE Notebook are necessary?

Because data there is not located under shared...
This is btw also the issue here.
Can someone explain the difference?

@BachirNILU
Copy link

Thanks @eox-cs1!
I think a direct cross-UC access makes sense with respect to users (limited access).
For instance, I am involved in both UC1 and UC4. A cross-UC access to "only" UC1 and UC4 would make sense.
This can be adapted to each user.
For the suggested options, how can I use option 2 (access UC4 bucket from UC1)? What commands should I run?

Thanks in advance.

@Schpidi
Copy link
Member

Schpidi commented Jun 18, 2024

@BachirNILU I believe in your case the simplest would be to add another profile option to UC4. Note that this per se does not incur costs, only when you run a session there are costs.

If you want to use data from a use case specific bucket at any other place you can retrieve the required details like access keys, etc. from the env variable for example with a command like printenv|grep S3_USER.

@BachirNILU
Copy link

Thank you @Schpidi for your response!
This is great! If the additional UC4 profile option does not entail additional costs (unless used of course), that works perfectly for us (given that we do not use such memory frequently). A RAM of 60 GB will be great.
Thanks again!

@Schpidi
Copy link
Member

Schpidi commented Jun 18, 2024

@sonjastndl sorry, there is a little misunderstanding that I might have caused in the last call.

We offer two types of storage which "File Storage" and "Object Storage" which have slightly different capabilities.

Per default in a JupyterLab session you get your personal workspace as well as a shared folder (/shared/fairicube/ or ~/.shared/fairicube/) which are both persisted on File Storage.

In addition we provide Object Storage to each Use Case separately for example to use with Sentinel Hub. This Object Storage is for convenience mounted to ~/s3 for each user but preferably used via the s3 protocol.

On top of this we were asked to provide shared Object Storage accessible to all Use Cases. This is the fairicube bucket where access keys are shared for usage. This Object Storage is not automatically mounted in the JupyterLab session.

I hope this clarifies your questions.

@Schpidi
Copy link
Member

Schpidi commented Jun 18, 2024

@BachirNILU we'll roll out the required configuration tomorrow and enable a large profile for UC4.

@Schpidi
Copy link
Member

Schpidi commented Jun 19, 2024

@BachirNILU the big UC4 profile (Server Option) is now available

@Schpidi
Copy link
Member

Schpidi commented Jun 19, 2024

With this I believe we can come back to the original question 😉

From a technical point of view the questions to answer are:

  • Do I need the data only locally, i.e., in JupyterLab? --> Use your local workspace
  • Do I want to share the data with all users of FAIRiCUBE locally? --> Use the shared folder
  • Do I want to share the data with all users in my Use Case or do I need external access like via Sentinel Hub services? --> Use the UC bucket
  • Do I want to share the data with all users of FAIRiCUBE and need external access? --> Use the shared bucket

How to organize data on the bucket doesn't matter from a technical point of view but ,I agree, should be agreed on within a UC team or all FAIRiCUBE users and documented.

@KathiSchleidt
Copy link
Member

@Schpidi many thanks for the clarification, but I fear now I understand exactly nothing :(

I read about "File Storage" and "Object Storage" which have slightly different capabilities., but no indication of what these slight differences are

I then find either 3 or 4 options of where to put data:

  1. Per default in a JupyterLab session you get your personal workspace as well as a shared folder (/shared/fairicube/ or ~/.shared/fairicube/) which are both persisted on File Storage.
  2. In addition we provide Object Storage to each Use Case separately for example to use with Sentinel Hub. This Object Storage is for convenience mounted to ~/s3 for each user but preferably used via the s3 protocol.
  3. On top of this we were asked to provide shared Object Storage accessible to all Use Cases. This is the fairicube bucket where access keys are shared for usage. This Object Storage is not automatically mounted in the JupyterLab session.

vs.

  1. Do I need the data only locally, i.e., in JupyterLab? --> Use your local workspace
  2. Do I want to share the data with all users of FAIRiCUBE locally? --> Use the shared folder
  3. Do I want to share the data with all users in my Use Case or do I need external access like via Sentinel Hub services? --> Use the UC bucket
  4. Do I want to share the data with all users of FAIRiCUBE and need external access? --> Use the shared bucket

On common bucket I've found some documentation in the FAIRiBOOK, requirement to install the s3browser

TL;DR; the more I read, the less I see. When can we expect clear documentation on this?

@Schpidi
Copy link
Member

Schpidi commented Jun 19, 2024

@KathiSchleidt it is always 4 options:

  1. Your workspace
  2. Shared folder in workspace
  3. UC bucket
  4. Shared bucket

The different capabilities specific to FiC are also mentioned: "... to use with Sentinel Hub." In general the difference is: "... used via the s3 protocol." from anywhere vs. normal file system only available in JupyterLab.

What are you missing?

@KathiSchleidt
Copy link
Member

@Schpidi what I'm missing is a clean description of these various dimensions (objects vs. files, buckets vs filesystem) and options for providing and using this data. I admit I'm exceptionally confused due to my being less active in FAIRiCUBE the last months, but based on discussions with UC partners, seems I'm not the only one.

I have the impression that the applicability of APIs is also somehow related (still waiting on that answer, now close to 5 months waiting :( ), please clarify where we can apply APIs

When can we expect this to be clearly explained in RTD?

@BachirNILU
Copy link

@BachirNILU the big UC4 profile (Server Option) is now available

Thanks! It works.

@KathiSchleidt
Copy link
Member

@Schpidi am I correct that there is no ambition to document this? I just checked the adding datasets section on RTD, nothing there.

@Schpidi
Copy link
Member

Schpidi commented Sep 9, 2024

Added a first guide on storage to RTD for review either at https://fairicube--8.org.readthedocs.build/en/8/guide/storage/ or FAIRiCUBE/collaboration-platform#8 Happy to read your feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants