-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update fix/analytics data export cleanup #4037
Update fix/analytics data export cleanup #4037
Conversation
📝 Walkthrough📝 WalkthroughWalkthroughThe pull request introduces several modifications across multiple files, primarily enhancing the Changes
Possibly related PRs
Suggested labels
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (5)
src/analytics/api/models/events.py (2)
551-590
: Simplify and optimize thesimple_data_cleaning
methodThere appears to be redundancy in the data cleaning logic, specifically in handling NaN values for
pm2_5
.Refactor the method to eliminate redundant operations:
data["pm2_5_raw_value"] = pd.to_numeric( data["pm2_5_raw_value"], errors="coerce" ) data["pm2_5"] = pd.to_numeric(data["pm2_5"], errors="coerce") - data.loc[~data["pm2_5_raw_value"].isna(), "pm2_5"] = np.nan if data["pm2_5_raw_value"].isna().all(): data.drop(columns=["pm2_5_raw_value"], inplace=True) +else: + data.loc[~data["pm2_5_raw_value"].isna(), "pm2_5"] = np.nan data.dropna(how="all", axis=1, inplace=True)This change ensures that
pm2_5
values are set toNaN
only whenpm2_5_raw_value
has data, avoiding unnecessary reassignment.
551-590
: Add unit tests forsimple_data_cleaning
Given the importance of the new
simple_data_cleaning
method, it's essential to have unit tests covering various scenarios to ensure reliability.Consider adding unit tests that cover:
- Cases where only
pm2_5
has data.- Cases where only
pm2_5_raw_value
has data.- Cases where both have data.
- Edge cases with missing or all
NaN
values.src/analytics/api/utils/data_formatters.py (1)
331-354
: Handle exceptions when fetching networksIn
validate_network
, the API call to/users/networks
may fail. It's important to handle exceptions and log them appropriately.Add exception handling around the API request:
try: response = airqo_requests.request(endpoint=endpoint, method="get") except Exception as e: logger.exception(f"Error fetching networks: {e}") return Falsesrc/analytics/api/views/data.py (2)
Line range hint
590-599
: Use appropriate HTTP status code for "No data found"Returning HTTP 200 OK with success=False for "No data found" scenario is inconsistent with REST principles. Consider using HTTP 404 Not Found to properly indicate the absence of data.
- AirQoRequests.Status.HTTP_200_OK, + AirQoRequests.Status.HTTP_404_NOT_FOUND,
Line range hint
558-564
: Enhance parameter validation for gridWhile validation has been added for new parameters, the
grid
parameter lacks type validation. Consider adding specific validation rules to ensure the grid parameter meets the expected format.Consider updating the request validation decorator:
@validate_request_json( "startDateTime|required:datetime", "endDateTime|required:datetime", "airqloud|optional:str", "cohort|optional:str", - "grid|optional:str", + "grid|optional:str|pattern=^[A-Z0-9]+$", # Adjust pattern based on your grid format )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
src/analytics/api/models/events.py
(6 hunks)src/analytics/api/utils/data_formatters.py
(1 hunks)src/analytics/api/utils/pollutants/pm_25.py
(1 hunks)src/analytics/api/views/dashboard.py
(1 hunks)src/analytics/api/views/data.py
(1 hunks)
🧰 Additional context used
📓 Learnings (1)
src/analytics/api/models/events.py (1)
Learnt from: NicholasTurner23
PR: airqo-platform/AirQo-api#3819
File: src/analytics/api/models/events.py:47-48
Timestamp: 2024-11-12T09:15:21.461Z
Learning: Certain modules have a different version of `download_from_bigquery`, and updates to align their parameters will happen at a later time.
🔇 Additional comments (5)
src/analytics/api/models/events.py (1)
542-544
: Ensure data cleaning is correctly applied for raw data
The addition of cls.simple_data_cleaning(dataframe)
for raw data processing is appropriate. Verify that this cleaning step is necessary only for raw data and that it doesn't inadvertently affect other data types.
Run the following script to confirm that data cleaning is only applied when data_type
is 'raw'
:
✅ Verification successful
Let me analyze the implementation of the data cleaning method and other data type handling to ensure there's no unintended impact.
Let me check the implementation details of the data cleaning method and any related cleaning operations to ensure it's properly scoped.
Data cleaning is correctly scoped to raw data type
Based on the codebase analysis, the simple_data_cleaning
method is exclusively called within the if data_type == "raw":
condition, and there are no other instances where this cleaning is applied to other data types. The implementation follows a clear and intentional pattern where raw data requires specific cleaning operations.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if 'simple_data_cleaning' is called only for raw data.
rg --no-heading --line-number "if data_type == \"raw\":" src/analytics/api/models/events.py -A 2
Length of output: 383
Script:
#!/bin/bash
# Check for any other data cleaning methods or data type handling
rg --no-heading --line-number "data_type" src/analytics/api/models/events.py -B 2 -A 2
# Look for simple_data_cleaning implementation
ast-grep --pattern 'def simple_data_cleaning($$$)'
Length of output: 1957
Script:
#!/bin/bash
# Search for the implementation of simple_data_cleaning method
rg --no-heading --line-number "simple_data_cleaning" src/analytics/api/models/events.py -B 3 -A 5
# Check for any other data cleaning methods in the file
rg --no-heading --line-number "clean" src/analytics/api/models/events.py
Length of output: 1186
src/analytics/api/utils/pollutants/pm_25.py (1)
Line range hint 68-72
: Confirm removal of the 'raw' key in BIGQUERY_FREQUENCY_MAPPER
The 'raw' frequency key has been removed from BIGQUERY_FREQUENCY_MAPPER
. Verify that this does not affect any functions that rely on raw frequency mappings.
Check if any code references the 'raw' key in BIGQUERY_FREQUENCY_MAPPER
:
src/analytics/api/views/dashboard.py (2)
227-230
: Ensure tenant consistency in data retrieval
In the D3ChartDataResource
class, the tenant is hardcoded as "airqo"
. Verify that this is intentional and won't affect multi-tenant support.
If the application supports multiple tenants, consider retrieving the tenant dynamically from the request or context.
227-230
: Update method call after removing tenant
parameter
The call to get_d3_chart_events_v2
no longer includes the tenant
parameter, aligning with the updated method signature.
src/analytics/api/views/data.py (1)
169-169
: Reconsider dropping the timestamp column
Dropping the timestamp column when minimum_output
is True could be problematic as temporal information is crucial for time-series air quality data. This might affect:
- Data traceability and analysis
- Client applications expecting timestamp information
- Time-based visualizations and charts
Let's verify if any client code depends on the timestamp field:
✅ Verification successful
The previous search revealed frontend usage of timestamps, but we need to verify the analytics service's API contract and dependencies. Let's check the Python codebase and any API documentation within the analytics service to understand if removing the timestamp field could break existing functionality.
Dropping timestamp is safe with minimum_output flag
Based on the codebase analysis, dropping the timestamp column when minimum_output
is True is actually a valid optimization:
- The timestamp field is only dropped when explicitly requested via
minimum_output=True
, indicating a conscious choice by the API consumer to receive minimal data - The codebase shows that timestamp is primarily used for:
- Internal data processing (sorting, grouping)
- Converting to local timezones
- Generating time-based visualizations
- These operations happen before the final output is returned
The change aligns with the API's purpose of providing a minimal data option while preserving the timestamp for all other use cases.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for timestamp usage in frontend code
rg -g '*.{js,jsx,ts,tsx}' 'timestamp'
# Search for timestamp in API documentation
rg -g '*.yml' 'timestamp' api/docs/
Length of output: 16186
Script:
#!/bin/bash
# Search for timestamp usage in Python files related to analytics/data
rg -g '*.py' -B 3 -A 3 'timestamp' src/analytics/
# Search for API documentation or schema definitions
fd -e yaml -e yml . src/analytics/ -x cat {}
Length of output: 55464
def get_d3_chart_events_v2( | ||
self, sites, start_date, end_date, pollutant, frequency, tenant | ||
): | ||
def get_d3_chart_events_v2(self, sites, start_date, end_date, pollutant, frequency): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Update required: Inconsistent method signatures across codebase
The verification reveals that there are inconsistent implementations and usages of get_d3_chart_events_v2
:
src/insights/models/events.py:1010
: Method defined withtenant
parametersrc/insights/namespaces/dashboard.py:147
: Called withtenant
parametersrc/analytics/api/models/events.py:1322
: Method defined withouttenant
parametersrc/analytics/api/views/dashboard.py:229
: Called withouttenant
parameter
This indicates a potential architectural issue where two different implementations of the same method exist with different signatures, which could lead to runtime errors.
🔗 Analysis chain
Verify all usages of get_d3_chart_events_v2
are updated
The method get_d3_chart_events_v2
no longer includes the tenant
parameter. Ensure that all calls to this method have been updated to prevent potential TypeError
exceptions.
Run the following script to locate all usages:
Review each occurrence to confirm that the tenant
parameter is no longer being passed.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Find all calls to 'get_d3_chart_events_v2' and check parameters.
rg --no-heading --line-number 'get_d3_chart_events_v2\(' src/ -A 1
Length of output: 888
@@ -413,6 +413,7 @@ | |||
frequency (str): Data frequency (e.g., 'raw', 'daily', 'hourly'). | |||
pollutants (list): List of pollutants to include in the data. | |||
data_type (str): Type of data ('raw' or 'aggregated'). | |||
filter_columns(list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the docstring formatting for filter_columns
parameter
The new parameter filter_columns(list)
in the docstring is not properly formatted. It should include a colon and a descriptive explanation to match the existing docstring style.
Apply this diff to correct the docstring:
filter_value (list): Filter values corresponding to the filter type.
start_date (str): Start date for the data query.
end_date (str): End date for the data query.
frequency (str): Data frequency (e.g., 'raw', 'daily', 'hourly').
pollutants (list): List of pollutants to include in the data.
data_type (str): Type of data ('raw' or 'aggregated').
+filter_columns (list): List of columns to filter.
weather_fields (list): List of weather fields to retrieve.
Committable suggestion skipped: line range outside the PR's diff.
def validate_network(self, network_name: str) -> bool: | ||
""" | ||
Validate if a given network name exists in the list of networks. | ||
|
||
Args: | ||
network_name (str): The name of the network to validate. | ||
|
||
Returns: | ||
bool: True if the network name exists, False otherwise. | ||
""" | ||
if not network_name: | ||
return False | ||
|
||
endpoint: str = "/users/networks" | ||
airqo_requests = AirQoRequests() | ||
response = airqo_requests.request(endpoint=endpoint, method="get") | ||
|
||
if response and "networks" in response: | ||
networks = response["networks"] | ||
# TODO Could add an active network filter | ||
return any(network.get("net_name") == network_name for network in networks) | ||
|
||
return False | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unnecessary self
parameter from validate_network
function
The validate_network
function is defined with a self
parameter but is not inside a class, causing a TypeError
when called.
Modify the function definition to remove the self
parameter:
-def validate_network(self, network_name: str) -> bool:
+def validate_network(network_name: str) -> bool:
Also, ensure that all calls to this function are updated accordingly.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def validate_network(self, network_name: str) -> bool: | |
""" | |
Validate if a given network name exists in the list of networks. | |
Args: | |
network_name (str): The name of the network to validate. | |
Returns: | |
bool: True if the network name exists, False otherwise. | |
""" | |
if not network_name: | |
return False | |
endpoint: str = "/users/networks" | |
airqo_requests = AirQoRequests() | |
response = airqo_requests.request(endpoint=endpoint, method="get") | |
if response and "networks" in response: | |
networks = response["networks"] | |
# TODO Could add an active network filter | |
return any(network.get("net_name") == network_name for network in networks) | |
return False | |
def validate_network(network_name: str) -> bool: | |
""" | |
Validate if a given network name exists in the list of networks. | |
Args: | |
network_name (str): The name of the network to validate. | |
Returns: | |
bool: True if the network name exists, False otherwise. | |
""" | |
if not network_name: | |
return False | |
endpoint: str = "/users/networks" | |
airqo_requests = AirQoRequests() | |
response = airqo_requests.request(endpoint=endpoint, method="get") | |
if response and "networks" in response: | |
networks = response["networks"] | |
# TODO Could add an active network filter | |
return any(network.get("net_name") == network_name for network in networks) | |
return False |
Clean up
Description
[Provide a brief description of the changes made in this PR]
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Refactor
Chores