Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update fix/clean up #4050

Merged
merged 2 commits into from
Dec 12, 2024

Conversation

NicholasTurner23
Copy link
Contributor

@NicholasTurner23 NicholasTurner23 commented Dec 12, 2024

Description

[Provide a brief description of the changes made in this PR]
This PR corrects the updates to the devices and sites tables in bigquery. The data in these tables is used by the analytics application when making joins to gains some extra details.

Related Issues

  • JIRA cards:
    • OPS-324

Summary by CodeRabbit

  • New Features
    • Introduced new fields last_updated in device schema and network in site schema to enhance data tracking.
  • Bug Fixes
    • Improved error handling for invalid date inputs in data fetching methods.
  • Documentation
    • Added detailed docstrings to methods for better clarity.
  • Refactor
    • Updated method parameters and DataFrame structures for improved consistency and clarity.

Copy link
Contributor

coderabbitai bot commented Dec 12, 2024

📝 Walkthrough

Walkthrough

The changes in this pull request enhance the BigQueryApi and MetaDataUtils classes by updating method signatures, improving error handling, and modifying DataFrame structures. Key alterations include changing the parameter from tenant to network, adding new fields to JSON schemas for devices and sites, and refining logging practices. The updates aim to improve clarity and robustness in data handling processes while maintaining the core functionalities of the methods involved.

Changes

File Path Change Summary
src/workflows/airqo_etl_utils/bigquery_api.py - Added docstring to validate_data method.
- Updated add_unique_id and device_unique_col methods to change parameter from tenant to network.
- Enhanced error handling in fetch_device_data_for_forecast_job and fetch_device_data_for_satellite_job methods.
src/workflows/airqo_etl_utils/meta_data_utils.py - Removed columns "tenant", "approximate_latitude", and "approximate_longitude" from extract_devices_from_api method.
- Replaced "tenant" with "network" in extract_sites_from_api method.
- Renamed columns in extract_sites_from_api method.
src/workflows/airqo_etl_utils/schema/devices.json - Added new field last_updated of type TIMESTAMP, mode NULLABLE.
src/workflows/airqo_etl_utils/schema/sites.json - Added required field network of type STRING and nullable field last_updated of type DATE.

Possibly related PRs

Suggested reviewers

  • Baalmart
  • BenjaminSsempala
  • OchiengPaul442

🌟 In the realm of code, changes unfold,
With network in hand, new stories are told.
From tenant to ID, the data takes flight,
Enhancing our tools, making everything right.
With schemas that grow and methods that shine,
Here’s to our progress, one line at a time! 🎉


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (6)
src/workflows/airqo_etl_utils/schema/devices.json (1)

66-70: Ensure timestamp timezone handling is documented

The addition of 'last_updated' as TIMESTAMP is good for tracking changes. However, we should ensure that the timezone handling is consistent across the application.

Consider adding a comment in the schema to specify the expected timezone format (UTC recommended).

src/workflows/airqo_etl_utils/meta_data_utils.py (3)

Line range hint 11-29: Update method signature to use network instead of tenant

The method still accepts tenant parameter while using network internally. This should be updated for consistency:

-    def extract_devices_from_api(tenant: Tenant = Tenant.ALL) -> pd.DataFrame:
+    def extract_devices_from_api(network: str = "all") -> pd.DataFrame:

Also, consider adding the 'last_updated' field to the DataFrame to match the schema:

         dataframe = dataframe[
             [
                 "network",
                 "latitude",
                 "longitude",
                 "site_id",
                 "device_id",
                 "device_number",
                 "name",
                 "description",
                 "device_manufacturer",
                 "device_category",
+                "last_updated",
             ]
         ]

Line range hint 123-143: Add last_updated field to sites extraction

The sites schema includes 'last_updated' but it's not being extracted in the DataFrame columns.

         dataframe = dataframe[
             [
                 "network",
                 "site_id",
                 "latitude",
                 "longitude",
                 "approximate_latitude",
                 "approximate_longitude",
                 "name",
                 "location",
                 "search_name",
                 "location_name",
                 "description",
                 "city",
                 "region",
                 "country",
+                "last_updated",
             ]
         ]

Update tenant parameters to network in metadata methods

The review comment is correct. The search results confirm inconsistency in parameter usage, where some methods have been updated to use network while others still use tenant. The following methods in meta_data_utils.py need to be updated:

  • extract_airqlouds_from_api(tenant: Tenant)
  • extract_grids_from_api(tenant: Tenant)
  • extract_cohorts_from_api(tenant: Tenant)
  • refresh_airqlouds(tenant: Tenant)
  • refresh_grids(tenant: Tenant)

These methods should be updated to use network: str parameter for consistency with other refactored methods like extract_sites_from_api(network: str) and update_sites_distance_measures(network: str).

🔗 Analysis chain

Line range hint 186-234: Update remaining tenant references

Several methods still use the tenant parameter while others have been updated to use network. This inconsistency should be addressed.

Methods to update:

  • extract_airqlouds_from_api
  • extract_grids_from_api
  • extract_cohorts_from_api
  • refresh_airqlouds
  • refresh_grids
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for remaining tenant parameters in method signatures
ast-grep --pattern 'def $_(tenant: Tenant)'

Length of output: 45


Script:

#!/bin/bash
# Search for methods with tenant parameter in a more flexible way
rg -U "def \w+\([^)]*tenant[^)]*\)" --type py

# Search for methods that use network parameter
rg -U "def \w+\([^)]*network[^)]*\)" --type py

# Search specifically for the mentioned method names
rg -U "def (extract_airqlouds_from_api|extract_grids_from_api|extract_cohorts_from_api|refresh_airqlouds|refresh_grids)" --type py

Length of output: 13917

src/workflows/airqo_etl_utils/bigquery_api.py (2)

121-143: Well-structured docstring addition!

The comprehensive docstring clearly documents the method's purpose, parameters, return value, and exceptions. Consider enhancing it further by adding an example usage section.

Add an example section like this:

"""
Example:
    >>> df = pd.DataFrame({'col1': [1, 2], 'col2': ['a', 'b']})
    >>> validated_df = validate_data(df, 'my_table')
"""

Line range hint 674-677: Consider consolidating datetime validation.

The start_date_time validation is duplicated in both fetch_device_data_for_forecast_job and fetch_device_data_for_satellite_job. Consider extracting it into a utility method.

def _validate_datetime(date_time: str, param_name: str = "date_time") -> None:
    """Validates if the given string is a valid datetime.
    
    Args:
        date_time: The datetime string to validate
        param_name: Name of the parameter for error message
    
    Raises:
        ValueError: If the datetime string is invalid
    """
    try:
        pd.to_datetime(date_time)
    except ValueError:
        raise ValueError(f"Invalid {param_name}: {date_time}")

Then use it in both methods:

def fetch_device_data_for_forecast_job(self, start_date_time: str, job_type: str):
    self._validate_datetime(start_date_time, "start date time")
    # rest of the method...

def fetch_device_data_for_satellite_job(self, start_date_time: str, job_type: str):
    self._validate_datetime(start_date_time, "start date time")
    # rest of the method...

Also applies to: 726-729

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 83a69df and 93fcf00.

📒 Files selected for processing (4)
  • src/workflows/airqo_etl_utils/bigquery_api.py (3 hunks)
  • src/workflows/airqo_etl_utils/meta_data_utils.py (1 hunks)
  • src/workflows/airqo_etl_utils/schema/devices.json (1 hunks)
  • src/workflows/airqo_etl_utils/schema/sites.json (2 hunks)
🔇 Additional comments (4)
src/workflows/airqo_etl_utils/schema/devices.json (1)

Line range hint 1-11: Consider consolidating 'tenant' and 'network' fields

The schema currently maintains both 'tenant' and 'network' as REQUIRED fields. Based on the changes in meta_data_utils.py where 'tenant' is being replaced with 'network', we should consider deprecating the 'tenant' field to maintain consistency across the codebase.

src/workflows/airqo_etl_utils/schema/sites.json (1)

7-11: Maintain consistency with network/tenant approach

Similar to the devices schema, we have both 'tenant' and 'network' as REQUIRED fields. This should be consolidated as part of the migration from 'tenant' to 'network'.

src/workflows/airqo_etl_utils/bigquery_api.py (2)

150-153: Improved logging implementation!

The switch from print statements to logger.warning enhances error tracking and maintains consistency with Python logging best practices.


270-270: Verify the tenant to network parameter change impact.

The parameter rename from tenant to network in device_unique_col method could affect existing code that uses this method.

Also applies to: 279-280

✅ Verification successful

Let me gather more information to verify if there are any other references to this parameter change.


Parameter rename from tenant to network is consistent

The change from tenant to network in the device_unique_col method is localized and properly implemented. The method is only used once in the codebase for generating unique device identifiers, and the parameter change is consistently applied at both the definition and call site. Other tenant-related functionality in the codebase operates independently of this specific method.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining references to 'tenant' in unique ID generation
# and verify the consistency of the network parameter usage

# Search for tenant-based unique ID generation
rg -A 2 "device_unique_col.*tenant" 

# Search for network-based unique ID generation to confirm adoption
rg -A 2 "device_unique_col.*network"

Length of output: 361


Script:

#!/bin/bash
# Search for any other occurrences of tenant in the codebase
rg -A 2 "tenant.*device" 

# Search for any imports or usage of the device_unique_col function
rg -A 2 "from.*bigquery_api.*import.*device_unique_col|device_unique_col\("

Length of output: 18285

Comment on lines +78 to +80
"name": "last_updated",
"type": "DATE",
"mode": "NULLABLE"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Align date/time type with devices schema

The 'last_updated' field is defined as DATE in sites but TIMESTAMP in devices. This inconsistency could lead to confusion and potential issues in data analysis.

Consider using TIMESTAMP consistently across both schemas:

-    "type": "DATE",
+    "type": "TIMESTAMP",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"name": "last_updated",
"type": "DATE",
"mode": "NULLABLE"
"name": "last_updated",
"type": "TIMESTAMP",
"mode": "NULLABLE"

@Baalmart Baalmart merged commit c43059f into airqo-platform:staging Dec 12, 2024
46 checks passed
@Baalmart Baalmart mentioned this pull request Dec 12, 2024
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants