Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update fix/analytics data export cleanup #4037

Conversation

NicholasTurner23
Copy link
Contributor

@NicholasTurner23 NicholasTurner23 commented Dec 10, 2024

Description

[Provide a brief description of the changes made in this PR]

  • Fixes an issue where charts do not return any data.
  • Creates a provision for non airqo data to be downloaded.

Summary by CodeRabbit

  • New Features

    • Introduced data cleaning functionality for specific pollutant columns.
    • Added validation for network names within the analytics utility.
  • Bug Fixes

    • Enhanced error handling for various failure scenarios in data retrieval and validation processes.
  • Documentation

    • Updated docstrings to clarify argument types for methods.
  • Refactor

    • Streamlined data retrieval processes and simplified method signatures across multiple classes.
    • Adjusted logic for handling pollutants and data retrieval for improved clarity and efficiency.
  • Chores

    • Updated mappings for pollutants in the analytics utilities.

Copy link
Contributor

coderabbitai bot commented Dec 10, 2024

📝 Walkthrough
📝 Walkthrough

Walkthrough

The pull request introduces several modifications across multiple files, primarily enhancing the EventsModel class in src/analytics/api/models/events.py. Key changes include the addition of a filter_columns parameter in the download_from_bigquery method, the introduction of a new simple_data_cleaning method, and updates to SQL query handling. Other files, such as data_formatters.py, pm_25.py, dashboard.py, and data.py, also see updates that refine data validation, enhance error handling, and streamline data retrieval processes.

Changes

File Path Change Summary
src/analytics/api/models/events.py - Added filter_columns parameter to download_from_bigquery.
- Introduced simple_data_cleaning method.
- Updated get_d3_chart_events_v2 to remove tenant parameter.
src/analytics/api/utils/data_formatters.py - Added validate_network method.
- Updated docstring for filter_non_private_sites.
src/analytics/api/utils/pollutants/pm_25.py - Updated COMMON_POLLUTANT_MAPPING and removed raw entry from BIGQUERY_FREQUENCY_MAPPER.
src/analytics/api/views/dashboard.py - Updated D3ChartDataResource and ChartDataResource post methods to streamline data retrieval.
src/analytics/api/views/data.py - Enhanced DataExportResource and DataSummaryResource post methods with additional validations and error handling.

Possibly related PRs

Suggested labels

hotfix, ready for review

Suggested reviewers

  • Baalmart
  • BenjaminSsempala
  • OchiengPaul442

🎉 In the realm of code, changes take flight,
New methods and parameters shining bright.
With cleaner queries and data so neat,
Our models now dance to a more efficient beat!
So let’s celebrate this coding spree,
For every line, a step towards harmony! 🌟


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 734dd17 and 84577f2.

📒 Files selected for processing (1)
  • src/analytics/api/utils/data_formatters.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/analytics/api/utils/data_formatters.py

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (5)
src/analytics/api/models/events.py (2)

551-590: Simplify and optimize the simple_data_cleaning method

There appears to be redundancy in the data cleaning logic, specifically in handling NaN values for pm2_5.

Refactor the method to eliminate redundant operations:

 data["pm2_5_raw_value"] = pd.to_numeric(
     data["pm2_5_raw_value"], errors="coerce"
 )
 data["pm2_5"] = pd.to_numeric(data["pm2_5"], errors="coerce")

- data.loc[~data["pm2_5_raw_value"].isna(), "pm2_5"] = np.nan

 if data["pm2_5_raw_value"].isna().all():
     data.drop(columns=["pm2_5_raw_value"], inplace=True)
+else:
+    data.loc[~data["pm2_5_raw_value"].isna(), "pm2_5"] = np.nan

 data.dropna(how="all", axis=1, inplace=True)

This change ensures that pm2_5 values are set to NaN only when pm2_5_raw_value has data, avoiding unnecessary reassignment.


551-590: Add unit tests for simple_data_cleaning

Given the importance of the new simple_data_cleaning method, it's essential to have unit tests covering various scenarios to ensure reliability.

Consider adding unit tests that cover:

  • Cases where only pm2_5 has data.
  • Cases where only pm2_5_raw_value has data.
  • Cases where both have data.
  • Edge cases with missing or all NaN values.
src/analytics/api/utils/data_formatters.py (1)

331-354: Handle exceptions when fetching networks

In validate_network, the API call to /users/networks may fail. It's important to handle exceptions and log them appropriately.

Add exception handling around the API request:

try:
    response = airqo_requests.request(endpoint=endpoint, method="get")
except Exception as e:
    logger.exception(f"Error fetching networks: {e}")
    return False
src/analytics/api/views/data.py (2)

Line range hint 590-599: Use appropriate HTTP status code for "No data found"

Returning HTTP 200 OK with success=False for "No data found" scenario is inconsistent with REST principles. Consider using HTTP 404 Not Found to properly indicate the absence of data.

-                    AirQoRequests.Status.HTTP_200_OK,
+                    AirQoRequests.Status.HTTP_404_NOT_FOUND,

Line range hint 558-564: Enhance parameter validation for grid

While validation has been added for new parameters, the grid parameter lacks type validation. Consider adding specific validation rules to ensure the grid parameter meets the expected format.

Consider updating the request validation decorator:

     @validate_request_json(
         "startDateTime|required:datetime",
         "endDateTime|required:datetime",
         "airqloud|optional:str",
         "cohort|optional:str",
-        "grid|optional:str",
+        "grid|optional:str|pattern=^[A-Z0-9]+$",  # Adjust pattern based on your grid format
     )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9cab881 and 734dd17.

📒 Files selected for processing (5)
  • src/analytics/api/models/events.py (6 hunks)
  • src/analytics/api/utils/data_formatters.py (1 hunks)
  • src/analytics/api/utils/pollutants/pm_25.py (1 hunks)
  • src/analytics/api/views/dashboard.py (1 hunks)
  • src/analytics/api/views/data.py (1 hunks)
🧰 Additional context used
📓 Learnings (1)
src/analytics/api/models/events.py (1)
Learnt from: NicholasTurner23
PR: airqo-platform/AirQo-api#3819
File: src/analytics/api/models/events.py:47-48
Timestamp: 2024-11-12T09:15:21.461Z
Learning: Certain modules have a different version of `download_from_bigquery`, and updates to align their parameters will happen at a later time.
🔇 Additional comments (5)
src/analytics/api/models/events.py (1)

542-544: Ensure data cleaning is correctly applied for raw data

The addition of cls.simple_data_cleaning(dataframe) for raw data processing is appropriate. Verify that this cleaning step is necessary only for raw data and that it doesn't inadvertently affect other data types.

Run the following script to confirm that data cleaning is only applied when data_type is 'raw':

✅ Verification successful

Let me analyze the implementation of the data cleaning method and other data type handling to ensure there's no unintended impact.


Let me check the implementation details of the data cleaning method and any related cleaning operations to ensure it's properly scoped.


Data cleaning is correctly scoped to raw data type

Based on the codebase analysis, the simple_data_cleaning method is exclusively called within the if data_type == "raw": condition, and there are no other instances where this cleaning is applied to other data types. The implementation follows a clear and intentional pattern where raw data requires specific cleaning operations.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if 'simple_data_cleaning' is called only for raw data.

rg --no-heading --line-number "if data_type == \"raw\":" src/analytics/api/models/events.py -A 2

Length of output: 383


Script:

#!/bin/bash
# Check for any other data cleaning methods or data type handling
rg --no-heading --line-number "data_type" src/analytics/api/models/events.py -B 2 -A 2

# Look for simple_data_cleaning implementation
ast-grep --pattern 'def simple_data_cleaning($$$)'

Length of output: 1957


Script:

#!/bin/bash
# Search for the implementation of simple_data_cleaning method
rg --no-heading --line-number "simple_data_cleaning" src/analytics/api/models/events.py -B 3 -A 5

# Check for any other data cleaning methods in the file
rg --no-heading --line-number "clean" src/analytics/api/models/events.py

Length of output: 1186

src/analytics/api/utils/pollutants/pm_25.py (1)

Line range hint 68-72: Confirm removal of the 'raw' key in BIGQUERY_FREQUENCY_MAPPER

The 'raw' frequency key has been removed from BIGQUERY_FREQUENCY_MAPPER. Verify that this does not affect any functions that rely on raw frequency mappings.

Check if any code references the 'raw' key in BIGQUERY_FREQUENCY_MAPPER:

src/analytics/api/views/dashboard.py (2)

227-230: Ensure tenant consistency in data retrieval

In the D3ChartDataResource class, the tenant is hardcoded as "airqo". Verify that this is intentional and won't affect multi-tenant support.

If the application supports multiple tenants, consider retrieving the tenant dynamically from the request or context.


227-230: Update method call after removing tenant parameter

The call to get_d3_chart_events_v2 no longer includes the tenant parameter, aligning with the updated method signature.

src/analytics/api/views/data.py (1)

169-169: Reconsider dropping the timestamp column

Dropping the timestamp column when minimum_output is True could be problematic as temporal information is crucial for time-series air quality data. This might affect:

  1. Data traceability and analysis
  2. Client applications expecting timestamp information
  3. Time-based visualizations and charts

Let's verify if any client code depends on the timestamp field:

✅ Verification successful

The previous search revealed frontend usage of timestamps, but we need to verify the analytics service's API contract and dependencies. Let's check the Python codebase and any API documentation within the analytics service to understand if removing the timestamp field could break existing functionality.


Dropping timestamp is safe with minimum_output flag

Based on the codebase analysis, dropping the timestamp column when minimum_output is True is actually a valid optimization:

  1. The timestamp field is only dropped when explicitly requested via minimum_output=True, indicating a conscious choice by the API consumer to receive minimal data
  2. The codebase shows that timestamp is primarily used for:
    • Internal data processing (sorting, grouping)
    • Converting to local timezones
    • Generating time-based visualizations
    • These operations happen before the final output is returned

The change aligns with the API's purpose of providing a minimal data option while preserving the timestamp for all other use cases.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for timestamp usage in frontend code
rg -g '*.{js,jsx,ts,tsx}' 'timestamp' 

# Search for timestamp in API documentation
rg -g '*.yml' 'timestamp' api/docs/

Length of output: 16186


Script:

#!/bin/bash
# Search for timestamp usage in Python files related to analytics/data
rg -g '*.py' -B 3 -A 3 'timestamp' src/analytics/

# Search for API documentation or schema definitions
fd -e yaml -e yml . src/analytics/ -x cat {}

Length of output: 55464

src/analytics/api/models/events.py Show resolved Hide resolved
def get_d3_chart_events_v2(
self, sites, start_date, end_date, pollutant, frequency, tenant
):
def get_d3_chart_events_v2(self, sites, start_date, end_date, pollutant, frequency):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Update required: Inconsistent method signatures across codebase

The verification reveals that there are inconsistent implementations and usages of get_d3_chart_events_v2:

  • src/insights/models/events.py:1010: Method defined with tenant parameter
  • src/insights/namespaces/dashboard.py:147: Called with tenant parameter
  • src/analytics/api/models/events.py:1322: Method defined without tenant parameter
  • src/analytics/api/views/dashboard.py:229: Called without tenant parameter

This indicates a potential architectural issue where two different implementations of the same method exist with different signatures, which could lead to runtime errors.

🔗 Analysis chain

Verify all usages of get_d3_chart_events_v2 are updated

The method get_d3_chart_events_v2 no longer includes the tenant parameter. Ensure that all calls to this method have been updated to prevent potential TypeError exceptions.

Run the following script to locate all usages:

Review each occurrence to confirm that the tenant parameter is no longer being passed.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find all calls to 'get_d3_chart_events_v2' and check parameters.

rg --no-heading --line-number 'get_d3_chart_events_v2\(' src/ -A 1

Length of output: 888

@@ -413,6 +413,7 @@
frequency (str): Data frequency (e.g., 'raw', 'daily', 'hourly').
pollutants (list): List of pollutants to include in the data.
data_type (str): Type of data ('raw' or 'aggregated').
filter_columns(list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix the docstring formatting for filter_columns parameter

The new parameter filter_columns(list) in the docstring is not properly formatted. It should include a colon and a descriptive explanation to match the existing docstring style.

Apply this diff to correct the docstring:

 filter_value (list): Filter values corresponding to the filter type.
 start_date (str): Start date for the data query.
 end_date (str): End date for the data query.
 frequency (str): Data frequency (e.g., 'raw', 'daily', 'hourly').
 pollutants (list): List of pollutants to include in the data.
 data_type (str): Type of data ('raw' or 'aggregated').
+filter_columns (list): List of columns to filter.
 weather_fields (list): List of weather fields to retrieve.

Committable suggestion skipped: line range outside the PR's diff.

src/analytics/api/utils/pollutants/pm_25.py Show resolved Hide resolved
Comment on lines 331 to 354
def validate_network(self, network_name: str) -> bool:
"""
Validate if a given network name exists in the list of networks.

Args:
network_name (str): The name of the network to validate.

Returns:
bool: True if the network name exists, False otherwise.
"""
if not network_name:
return False

endpoint: str = "/users/networks"
airqo_requests = AirQoRequests()
response = airqo_requests.request(endpoint=endpoint, method="get")

if response and "networks" in response:
networks = response["networks"]
# TODO Could add an active network filter
return any(network.get("net_name") == network_name for network in networks)

return False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unnecessary self parameter from validate_network function

The validate_network function is defined with a self parameter but is not inside a class, causing a TypeError when called.

Modify the function definition to remove the self parameter:

-def validate_network(self, network_name: str) -> bool:
+def validate_network(network_name: str) -> bool:

Also, ensure that all calls to this function are updated accordingly.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def validate_network(self, network_name: str) -> bool:
"""
Validate if a given network name exists in the list of networks.
Args:
network_name (str): The name of the network to validate.
Returns:
bool: True if the network name exists, False otherwise.
"""
if not network_name:
return False
endpoint: str = "/users/networks"
airqo_requests = AirQoRequests()
response = airqo_requests.request(endpoint=endpoint, method="get")
if response and "networks" in response:
networks = response["networks"]
# TODO Could add an active network filter
return any(network.get("net_name") == network_name for network in networks)
return False
def validate_network(network_name: str) -> bool:
"""
Validate if a given network name exists in the list of networks.
Args:
network_name (str): The name of the network to validate.
Returns:
bool: True if the network name exists, False otherwise.
"""
if not network_name:
return False
endpoint: str = "/users/networks"
airqo_requests = AirQoRequests()
response = airqo_requests.request(endpoint=endpoint, method="get")
if response and "networks" in response:
networks = response["networks"]
# TODO Could add an active network filter
return any(network.get("net_name") == network_name for network in networks)
return False

@Baalmart Baalmart merged commit dd764c2 into airqo-platform:staging Dec 10, 2024
45 of 46 checks passed
@Baalmart Baalmart mentioned this pull request Dec 10, 2024
1 task
This was referenced Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants