Refactor process parquet #124

cbutsko · 2024-12-11T13:48:08Z

now process_parquet can also work with dekadal inputs
introduced the minimum list of required columns - ["sample_id", "timestamp", "lat", "lon"]
other index columns used for pivoting are formed dynamically depending on what is available in the dataframe
checks done for valid_time variable are now optional (e.g., checking if valid_time is outside of available observations range or too close to the edge)

kvantricht

Added some comments, after the desk discussion yesterday.

kvantricht · 2024-12-12T09:31:06Z

presto/utils.py

-    - reinitializing the start_date, end_date and timestamp_ind to take into account
-      newly added timesteps
+    - initializing the start_date and end_date as the first and last available observation;
+    - computing relative position of the timestamp (timestamp_ind variable) in the timeseries;
    - checking for missing timesteps in the middle of the timeseries and adding them
      with NODATA values


"filling them with NODATAVALUE"

kvantricht · 2024-12-12T09:33:04Z

presto/utils.py

-      takes into account updated start_date and end_date; available_timesteps
-      holds the absolute number of timesteps that for which observations are
+    - computing the number of available timesteps in the timeseries;
+      it represents the absolute number of timesteps for which observations are
      available; it cannot be less than NUM_TIMESTEPS; if this is the case,


How is this now more flexible for number of timesteps if the latter is still imported from presto-worldcereal. Doesn't seem to be a flexible parameter?

kvantricht · 2024-12-12T09:33:58Z

presto/utils.py

+            ["sample_id", "timestamp"].
+        use_valid_time (bool): If True, the function will use the valid_time column to check
+            if valid_time lies within the range of available observations,
+            with MIN_EDGE_BUFFER buffer.


should we make this a kwarg of the method?
min_edge_buffer: int = MIN_EDGE_BUFFER?

kvantricht · 2024-12-12T09:39:49Z

presto/utils.py

+    feature_columns = bands10m + bands20m + bands100m
+    # for index columns we need to include all columns that are not feature columns
+    index_columns = [col for col in df.columns if col not in feature_columns]
+    # and also ensure that static DEM columns and lat-lon are included.


is it a good idea to have the lat/lon features optional here? It means we can silently feed them as no data while in training this has never been the case. I would propose to keep them required. What do you think?

cfr desk discussion

kvantricht · 2024-12-12T10:04:01Z

presto/utils.py

-Dataset {df["ref_id"].iloc[0]}: removing {len(samples_after_end_date)} \
-samples with valid_date after the end_date \
-and {len(samples_before_start_date)} samples with valid_date before the start_date"""
+    static_features = ["DEM-alt-20m", "DEM-slo-20m", "lat", "lon"]


cfr desk discussion

kvantricht · 2024-12-13T07:28:58Z

presto/utils.py

+                (df["sample_id"].isin(samples_to_add_ts_after_end)) & (df["is_last_available_ts"])
+            ].copy()
+            dummy_df["timestamp"] = dummy_df["timestamp"] + pd.DateOffset(
+                months=n_ts_to_add


same question as above.

kvantricht · 2024-12-13T07:29:54Z

presto/utils.py

    )
+    index_columns.append("available_timesteps")

    # check for missing timestamps in the middle of timeseries


Does this comment belong to the actual code on the next line (being the concatenation of dataframes)?

kvantricht · 2024-12-13T07:30:10Z

presto/utils.py

@@ -314,6 +301,7 @@ def process_parquet(df: pd.DataFrame) -> pd.DataFrame:
        df = pd.concat([df, dummy_df])

    # finally pivot the dataframe
+    index_columns = list(np.unique(index_columns))
    df_pivot = df.pivot(index=index_columns, columns="timestamp_ind", values=feature_columns)
    df_pivot = df_pivot.fillna(NODATAVALUE)


this is the actual filling of timestamps "in the middle" ?

kvantricht · 2024-12-13T07:31:44Z

presto/utils.py

@@ -366,11 +349,8 @@ def process_parquet(df: pd.DataFrame) -> pd.DataFrame:
        )
    df_pivot = df_pivot[~samples_with_too_few_ts]


this can move inside the if for clarity

kvantricht · 2024-12-13T07:33:26Z

presto/utils.py

@@ -716,7 +696,7 @@ def prep_dataframe(
    # SAR cannot equal 0.0 since we take the log of it
    cols = [f"SAR-{s}-ts{t}-20m" for s in ["VV", "VH"] for t in range(36 if dekadal else 12)]

-    df = df.drop_duplicates(subset=["sample_id", "lat", "lon", "end_date"])
+    df = df.drop_duplicates(subset=["sample_id", "lat", "lon", "start_date"])


why does this happen?

Butsko Christina added 4 commits December 11, 2024 09:14

avoid pandas to_datetime for 12-timesteps case

97ebb7d

more generic handling of ts indexing

3293e75

refactoring process_parquet

2d2aa2d

using all non-feature columns as index; formatting

b6cd500

cbutsko requested a review from kvantricht December 11, 2024 13:48

cbutsko changed the base branch from main to croptype December 11, 2024 13:48

Butsko Christina added 2 commits December 11, 2024 14:54

removing lat-lon from required columns

aac145e

formatting

d88124a

cbutsko mentioned this pull request Dec 12, 2024

Refactor process_parquet #122

Open

3 tasks

kvantricht requested changes Dec 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor process parquet #124

Refactor process parquet #124

cbutsko commented Dec 11, 2024

kvantricht left a comment

kvantricht Dec 12, 2024

kvantricht Dec 12, 2024

kvantricht Dec 12, 2024

kvantricht Dec 12, 2024

kvantricht Dec 12, 2024

kvantricht Dec 12, 2024

kvantricht Dec 13, 2024

kvantricht Dec 13, 2024

kvantricht Dec 13, 2024

kvantricht Dec 13, 2024

kvantricht Dec 13, 2024

		@@ -366,11 +349,8 @@ def process_parquet(df: pd.DataFrame) -> pd.DataFrame:
		)
		df_pivot = df_pivot[~samples_with_too_few_ts]

Refactor process parquet #124

Are you sure you want to change the base?

Refactor process parquet #124

Conversation

cbutsko commented Dec 11, 2024

kvantricht left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment