Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor process parquet #124

Open
wants to merge 6 commits into
base: croptype
Choose a base branch
from
Open

Conversation

cbutsko
Copy link

@cbutsko cbutsko commented Dec 11, 2024

  • now process_parquet can also work with dekadal inputs
  • introduced the minimum list of required columns - ["sample_id", "timestamp", "lat", "lon"]
  • other index columns used for pivoting are formed dynamically depending on what is available in the dataframe
  • checks done for valid_time variable are now optional (e.g., checking if valid_time is outside of available observations range or too close to the edge)

@cbutsko cbutsko requested a review from kvantricht December 11, 2024 13:48
@cbutsko cbutsko changed the base branch from main to croptype December 11, 2024 13:48
@cbutsko cbutsko mentioned this pull request Dec 12, 2024
3 tasks
Copy link
Contributor

@kvantricht kvantricht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments, after the desk discussion yesterday.

- reinitializing the start_date, end_date and timestamp_ind to take into account
newly added timesteps
- initializing the start_date and end_date as the first and last available observation;
- computing relative position of the timestamp (timestamp_ind variable) in the timeseries;
- checking for missing timesteps in the middle of the timeseries and adding them
with NODATA values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"filling them with NODATAVALUE"

takes into account updated start_date and end_date; available_timesteps
holds the absolute number of timesteps that for which observations are
- computing the number of available timesteps in the timeseries;
it represents the absolute number of timesteps for which observations are
available; it cannot be less than NUM_TIMESTEPS; if this is the case,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this now more flexible for number of timesteps if the latter is still imported from presto-worldcereal. Doesn't seem to be a flexible parameter?

["sample_id", "timestamp"].
use_valid_time (bool): If True, the function will use the valid_time column to check
if valid_time lies within the range of available observations,
with MIN_EDGE_BUFFER buffer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make this a kwarg of the method?
min_edge_buffer: int = MIN_EDGE_BUFFER?

feature_columns = bands10m + bands20m + bands100m
# for index columns we need to include all columns that are not feature columns
index_columns = [col for col in df.columns if col not in feature_columns]
# and also ensure that static DEM columns and lat-lon are included.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a good idea to have the lat/lon features optional here? It means we can silently feed them as no data while in training this has never been the case. I would propose to keep them required. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cfr desk discussion

Dataset {df["ref_id"].iloc[0]}: removing {len(samples_after_end_date)} \
samples with valid_date after the end_date \
and {len(samples_before_start_date)} samples with valid_date before the start_date"""
static_features = ["DEM-alt-20m", "DEM-slo-20m", "lat", "lon"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cfr desk discussion

(df["sample_id"].isin(samples_to_add_ts_after_end)) & (df["is_last_available_ts"])
].copy()
dummy_df["timestamp"] = dummy_df["timestamp"] + pd.DateOffset(
months=n_ts_to_add
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question as above.

)
index_columns.append("available_timesteps")

# check for missing timestamps in the middle of timeseries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this comment belong to the actual code on the next line (being the concatenation of dataframes)?

@@ -314,6 +301,7 @@ def process_parquet(df: pd.DataFrame) -> pd.DataFrame:
df = pd.concat([df, dummy_df])

# finally pivot the dataframe
index_columns = list(np.unique(index_columns))
df_pivot = df.pivot(index=index_columns, columns="timestamp_ind", values=feature_columns)
df_pivot = df_pivot.fillna(NODATAVALUE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the actual filling of timestamps "in the middle" ?

@@ -366,11 +349,8 @@ def process_parquet(df: pd.DataFrame) -> pd.DataFrame:
)
df_pivot = df_pivot[~samples_with_too_few_ts]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can move inside the if for clarity

@@ -716,7 +696,7 @@ def prep_dataframe(
# SAR cannot equal 0.0 since we take the log of it
cols = [f"SAR-{s}-ts{t}-20m" for s in ["VV", "VH"] for t in range(36 if dekadal else 12)]

df = df.drop_duplicates(subset=["sample_id", "lat", "lon", "end_date"])
df = df.drop_duplicates(subset=["sample_id", "lat", "lon", "start_date"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this happen?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants