Ingestion of common/shared data #61

sonjastndl · 2024-06-05T12:41:16Z

Hi everyone,

I am curious about the modalities of the shared s3 bucket on EOX Hub.

If I access data (Example: Requesting a .nc on Surface Map for the year 2019 via the CDS API) at the EOX Hub, I am currently storing that data in our UC specific bucket. I can imagine that happening for other Use Cases and data as well and I am wondering, if we are having a lot of redundant data in the UC specific buckets.

Are there any recommendations on how to handle this from EOX side @Schpidi @eox-cs1? Just copying files to the common s3 might cause a bit of a chaos due to naming and other reasons, that does not avoid redundancy.
I assume we will talk about this in our next weeks meeting, still it would be good to know if there are already some foreseen "best practices" and/or restrictions that need to be considered?

@KathiSchleidt @Susannaioni

KathiSchleidt · 2024-06-17T11:38:45Z

@eox-cs1 @Schpidi we really need feedback on how to structure the S3 Buckets, how can UC partners:

make commonly accessible data available to all UC
keep sensitive data constrained to their UC
access both common and sensitive data via which server?

Related to the issue on server configuration, #64

eox-cs1 · 2024-06-17T12:27:30Z

* make commonly accessible data available to all UC

all UCs have access to a folder located at /shared/fairicube or /home//.shared/fairicube
In (below) this folder all Fairicube UCs can store data which is accessible by all UCs

* keep sensitive data constrained to their UC

every UC has its own s3 bucket associated. This is only accessible by the UC

* access both common and sensitive data via which server?

all UCs have access to a folder located at /shared/fairicube or /home//.shared/fairicube

The respective kernels (to be selected on the beginning of a session) influences the software tools available (as requested by each UC), but has no influence on the availability of the shared/private foldera (see above)
(see also #64)

KathiSchleidt · 2024-06-17T12:32:15Z

@eox-cs1 where is this documented?

BachirNILU · 2024-06-17T12:53:33Z

Hi @eox-cs1,

Thank you for your responses!
I have a very specific question.
I have uploaded a data to s3 bucket under UC4 server option.
So I understand this bucket is specific to UC4 users.
The resources (RAM) was not sufficient to run my code, so I plan to use another sever option (UC1) that has 60 GB of RAM and would be sufficient to run my code.
To avoid duplicating the data, I want to run the code after connecting to UC1 server option, and access data in UC4's s3 bucket, How can I do that?

Thanks in advance,

Best regards,

-Bachir.

eox-cs1 · 2024-06-17T13:19:08Z

A direct cross-UC access is not foreseen - would somehow contradict the separation.

However, you can either use the common /shared/fairicube folder for data exchange/access
OR
you can use the respective secrets from the desired storage and manually inject them to your environment (in addition to the use-case specific one). This should also give you access to the UC4 s3 bucket from UC1,
OR
you ask for a larger machine for UC4 (if you will need this more frequently)

sonjastndl · 2024-06-17T13:42:10Z

@mari-s4e
Hey Maria,
this is somehow contradictory to where we have been exchanging data right?
So for accessing the shared folder none of the actions documented in the FAIRiCUBE Notebook are necessary?

Because data there is not located under shared...
This is btw also the issue here.
Can someone explain the difference?

BachirNILU · 2024-06-17T13:57:08Z

Thanks @eox-cs1!
I think a direct cross-UC access makes sense with respect to users (limited access).
For instance, I am involved in both UC1 and UC4. A cross-UC access to "only" UC1 and UC4 would make sense.
This can be adapted to each user.
For the suggested options, how can I use option 2 (access UC4 bucket from UC1)? What commands should I run?

Thanks in advance.

Schpidi · 2024-06-18T07:53:17Z

@BachirNILU I believe in your case the simplest would be to add another profile option to UC4. Note that this per se does not incur costs, only when you run a session there are costs.

If you want to use data from a use case specific bucket at any other place you can retrieve the required details like access keys, etc. from the env variable for example with a command like printenv|grep S3_USER.

BachirNILU · 2024-06-18T08:04:29Z

Thank you @Schpidi for your response!
This is great! If the additional UC4 profile option does not entail additional costs (unless used of course), that works perfectly for us (given that we do not use such memory frequently). A RAM of 60 GB will be great.
Thanks again!

Schpidi · 2024-06-18T08:05:14Z

@sonjastndl sorry, there is a little misunderstanding that I might have caused in the last call.

We offer two types of storage which "File Storage" and "Object Storage" which have slightly different capabilities.

Per default in a JupyterLab session you get your personal workspace as well as a shared folder (/shared/fairicube/ or ~/.shared/fairicube/) which are both persisted on File Storage.

In addition we provide Object Storage to each Use Case separately for example to use with Sentinel Hub. This Object Storage is for convenience mounted to ~/s3 for each user but preferably used via the s3 protocol.

On top of this we were asked to provide shared Object Storage accessible to all Use Cases. This is the fairicube bucket where access keys are shared for usage. This Object Storage is not automatically mounted in the JupyterLab session.

I hope this clarifies your questions.

Schpidi · 2024-06-18T08:59:21Z

@BachirNILU we'll roll out the required configuration tomorrow and enable a large profile for UC4.

Schpidi · 2024-06-19T07:18:17Z

@BachirNILU the big UC4 profile (Server Option) is now available

Schpidi · 2024-06-19T07:30:56Z

With this I believe we can come back to the original question 😉

From a technical point of view the questions to answer are:

Do I need the data only locally, i.e., in JupyterLab? --> Use your local workspace
Do I want to share the data with all users of FAIRiCUBE locally? --> Use the shared folder
Do I want to share the data with all users in my Use Case or do I need external access like via Sentinel Hub services? --> Use the UC bucket
Do I want to share the data with all users of FAIRiCUBE and need external access? --> Use the shared bucket

How to organize data on the bucket doesn't matter from a technical point of view but ,I agree, should be agreed on within a UC team or all FAIRiCUBE users and documented.

KathiSchleidt · 2024-06-19T10:15:07Z

@Schpidi many thanks for the clarification, but I fear now I understand exactly nothing :(

I read about "File Storage" and "Object Storage" which have slightly different capabilities., but no indication of what these slight differences are

I then find either 3 or 4 options of where to put data:

Per default in a JupyterLab session you get your personal workspace as well as a shared folder (/shared/fairicube/ or ~/.shared/fairicube/) which are both persisted on File Storage.
In addition we provide Object Storage to each Use Case separately for example to use with Sentinel Hub. This Object Storage is for convenience mounted to ~/s3 for each user but preferably used via the s3 protocol.
On top of this we were asked to provide shared Object Storage accessible to all Use Cases. This is the fairicube bucket where access keys are shared for usage. This Object Storage is not automatically mounted in the JupyterLab session.

vs.

Do I need the data only locally, i.e., in JupyterLab? --> Use your local workspace
Do I want to share the data with all users of FAIRiCUBE locally? --> Use the shared folder
Do I want to share the data with all users in my Use Case or do I need external access like via Sentinel Hub services? --> Use the UC bucket
Do I want to share the data with all users of FAIRiCUBE and need external access? --> Use the shared bucket

On common bucket I've found some documentation in the FAIRiBOOK, requirement to install the s3browser

TL;DR; the more I read, the less I see. When can we expect clear documentation on this?

Schpidi · 2024-06-19T10:34:26Z

@KathiSchleidt it is always 4 options:

Your workspace
Shared folder in workspace
UC bucket
Shared bucket

The different capabilities specific to FiC are also mentioned: "... to use with Sentinel Hub." In general the difference is: "... used via the s3 protocol." from anywhere vs. normal file system only available in JupyterLab.

What are you missing?

KathiSchleidt · 2024-06-19T10:40:39Z

@Schpidi what I'm missing is a clean description of these various dimensions (objects vs. files, buckets vs filesystem) and options for providing and using this data. I admit I'm exceptionally confused due to my being less active in FAIRiCUBE the last months, but based on discussions with UC partners, seems I'm not the only one.

I have the impression that the applicability of APIs is also somehow related (still waiting on that answer, now close to 5 months waiting :( ), please clarify where we can apply APIs

When can we expect this to be clearly explained in RTD?

BachirNILU · 2024-06-20T10:04:17Z

@BachirNILU the big UC4 profile (Server Option) is now available

Thanks! It works.

KathiSchleidt · 2024-09-03T11:50:26Z

@Schpidi am I correct that there is no ambition to document this? I just checked the adding datasets section on RTD, nothing there.

Schpidi · 2024-09-09T20:53:01Z

Added a first guide on storage to RTD for review either at https://fairicube--8.org.readthedocs.build/en/8/guide/storage/ or FAIRiCUBE/collaboration-platform#8 Happy to read your feedback.

KathiSchleidt assigned Schpidi, eox-cs1, mari-s4e and Susannaioni Jun 5, 2024

KathiSchleidt assigned BachirNILU Jun 17, 2024

eox-cs1 mentioned this issue Jun 17, 2024

Server Configurations vs. UC vs. Buckets #64

Open

KathiSchleidt mentioned this issue Sep 23, 2024

Sharing data across FAIRiCUBE UCs on EOXHub #80

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingestion of common/shared data #61

Ingestion of common/shared data #61

sonjastndl commented Jun 5, 2024

KathiSchleidt commented Jun 17, 2024

eox-cs1 commented Jun 17, 2024

KathiSchleidt commented Jun 17, 2024

BachirNILU commented Jun 17, 2024

eox-cs1 commented Jun 17, 2024

sonjastndl commented Jun 17, 2024 •

edited

Loading

BachirNILU commented Jun 17, 2024

Schpidi commented Jun 18, 2024

BachirNILU commented Jun 18, 2024

Schpidi commented Jun 18, 2024

Schpidi commented Jun 18, 2024

Schpidi commented Jun 19, 2024

Schpidi commented Jun 19, 2024

KathiSchleidt commented Jun 19, 2024

Schpidi commented Jun 19, 2024

KathiSchleidt commented Jun 19, 2024

BachirNILU commented Jun 20, 2024

KathiSchleidt commented Sep 3, 2024

Schpidi commented Sep 9, 2024

Ingestion of common/shared data #61

Ingestion of common/shared data #61

Comments

sonjastndl commented Jun 5, 2024

KathiSchleidt commented Jun 17, 2024

eox-cs1 commented Jun 17, 2024

KathiSchleidt commented Jun 17, 2024

BachirNILU commented Jun 17, 2024

eox-cs1 commented Jun 17, 2024

sonjastndl commented Jun 17, 2024 • edited Loading

BachirNILU commented Jun 17, 2024

Schpidi commented Jun 18, 2024

BachirNILU commented Jun 18, 2024

Schpidi commented Jun 18, 2024

Schpidi commented Jun 18, 2024

Schpidi commented Jun 19, 2024

Schpidi commented Jun 19, 2024

KathiSchleidt commented Jun 19, 2024

Schpidi commented Jun 19, 2024

KathiSchleidt commented Jun 19, 2024

BachirNILU commented Jun 20, 2024

KathiSchleidt commented Sep 3, 2024

Schpidi commented Sep 9, 2024

sonjastndl commented Jun 17, 2024 •

edited

Loading