-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor process parquet #124
base: croptype
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments, after the desk discussion yesterday.
- reinitializing the start_date, end_date and timestamp_ind to take into account | ||
newly added timesteps | ||
- initializing the start_date and end_date as the first and last available observation; | ||
- computing relative position of the timestamp (timestamp_ind variable) in the timeseries; | ||
- checking for missing timesteps in the middle of the timeseries and adding them | ||
with NODATA values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"filling them with NODATAVALUE"
takes into account updated start_date and end_date; available_timesteps | ||
holds the absolute number of timesteps that for which observations are | ||
- computing the number of available timesteps in the timeseries; | ||
it represents the absolute number of timesteps for which observations are | ||
available; it cannot be less than NUM_TIMESTEPS; if this is the case, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this now more flexible for number of timesteps if the latter is still imported from presto-worldcereal. Doesn't seem to be a flexible parameter?
["sample_id", "timestamp"]. | ||
use_valid_time (bool): If True, the function will use the valid_time column to check | ||
if valid_time lies within the range of available observations, | ||
with MIN_EDGE_BUFFER buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make this a kwarg of the method?
min_edge_buffer: int = MIN_EDGE_BUFFER
?
feature_columns = bands10m + bands20m + bands100m | ||
# for index columns we need to include all columns that are not feature columns | ||
index_columns = [col for col in df.columns if col not in feature_columns] | ||
# and also ensure that static DEM columns and lat-lon are included. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it a good idea to have the lat/lon features optional here? It means we can silently feed them as no data while in training this has never been the case. I would propose to keep them required. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cfr desk discussion
Dataset {df["ref_id"].iloc[0]}: removing {len(samples_after_end_date)} \ | ||
samples with valid_date after the end_date \ | ||
and {len(samples_before_start_date)} samples with valid_date before the start_date""" | ||
static_features = ["DEM-alt-20m", "DEM-slo-20m", "lat", "lon"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cfr desk discussion
(df["sample_id"].isin(samples_to_add_ts_after_end)) & (df["is_last_available_ts"]) | ||
].copy() | ||
dummy_df["timestamp"] = dummy_df["timestamp"] + pd.DateOffset( | ||
months=n_ts_to_add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question as above.
) | ||
index_columns.append("available_timesteps") | ||
|
||
# check for missing timestamps in the middle of timeseries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this comment belong to the actual code on the next line (being the concatenation of dataframes)?
@@ -314,6 +301,7 @@ def process_parquet(df: pd.DataFrame) -> pd.DataFrame: | |||
df = pd.concat([df, dummy_df]) | |||
|
|||
# finally pivot the dataframe | |||
index_columns = list(np.unique(index_columns)) | |||
df_pivot = df.pivot(index=index_columns, columns="timestamp_ind", values=feature_columns) | |||
df_pivot = df_pivot.fillna(NODATAVALUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the actual filling of timestamps "in the middle" ?
@@ -366,11 +349,8 @@ def process_parquet(df: pd.DataFrame) -> pd.DataFrame: | |||
) | |||
df_pivot = df_pivot[~samples_with_too_few_ts] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can move inside the if
for clarity
@@ -716,7 +696,7 @@ def prep_dataframe( | |||
# SAR cannot equal 0.0 since we take the log of it | |||
cols = [f"SAR-{s}-ts{t}-20m" for s in ["VV", "VH"] for t in range(36 if dekadal else 12)] | |||
|
|||
df = df.drop_duplicates(subset=["sample_id", "lat", "lon", "end_date"]) | |||
df = df.drop_duplicates(subset=["sample_id", "lat", "lon", "start_date"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this happen?
["sample_id", "timestamp", "lat", "lon"]