-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update fix/clean up #4050
Update fix/clean up #4050
Conversation
📝 WalkthroughWalkthroughThe changes in this pull request enhance the Changes
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (6)
src/workflows/airqo_etl_utils/schema/devices.json (1)
66-70
: Ensure timestamp timezone handling is documentedThe addition of 'last_updated' as TIMESTAMP is good for tracking changes. However, we should ensure that the timezone handling is consistent across the application.
Consider adding a comment in the schema to specify the expected timezone format (UTC recommended).
src/workflows/airqo_etl_utils/meta_data_utils.py (3)
Line range hint
11-29
: Update method signature to use network instead of tenantThe method still accepts
tenant
parameter while usingnetwork
internally. This should be updated for consistency:- def extract_devices_from_api(tenant: Tenant = Tenant.ALL) -> pd.DataFrame: + def extract_devices_from_api(network: str = "all") -> pd.DataFrame:Also, consider adding the 'last_updated' field to the DataFrame to match the schema:
dataframe = dataframe[ [ "network", "latitude", "longitude", "site_id", "device_id", "device_number", "name", "description", "device_manufacturer", "device_category", + "last_updated", ] ]
Line range hint
123-143
: Add last_updated field to sites extractionThe sites schema includes 'last_updated' but it's not being extracted in the DataFrame columns.
dataframe = dataframe[ [ "network", "site_id", "latitude", "longitude", "approximate_latitude", "approximate_longitude", "name", "location", "search_name", "location_name", "description", "city", "region", "country", + "last_updated", ] ]
Update tenant parameters to network in metadata methods
The review comment is correct. The search results confirm inconsistency in parameter usage, where some methods have been updated to use
network
while others still usetenant
. The following methods inmeta_data_utils.py
need to be updated:
extract_airqlouds_from_api(tenant: Tenant)
extract_grids_from_api(tenant: Tenant)
extract_cohorts_from_api(tenant: Tenant)
refresh_airqlouds(tenant: Tenant)
refresh_grids(tenant: Tenant)
These methods should be updated to use
network: str
parameter for consistency with other refactored methods likeextract_sites_from_api(network: str)
andupdate_sites_distance_measures(network: str)
.🔗 Analysis chain
Line range hint
186-234
: Update remaining tenant referencesSeveral methods still use the
tenant
parameter while others have been updated to usenetwork
. This inconsistency should be addressed.Methods to update:
extract_airqlouds_from_api
extract_grids_from_api
extract_cohorts_from_api
refresh_airqlouds
refresh_grids
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check for remaining tenant parameters in method signatures ast-grep --pattern 'def $_(tenant: Tenant)'Length of output: 45
Script:
#!/bin/bash # Search for methods with tenant parameter in a more flexible way rg -U "def \w+\([^)]*tenant[^)]*\)" --type py # Search for methods that use network parameter rg -U "def \w+\([^)]*network[^)]*\)" --type py # Search specifically for the mentioned method names rg -U "def (extract_airqlouds_from_api|extract_grids_from_api|extract_cohorts_from_api|refresh_airqlouds|refresh_grids)" --type pyLength of output: 13917
src/workflows/airqo_etl_utils/bigquery_api.py (2)
121-143
: Well-structured docstring addition!The comprehensive docstring clearly documents the method's purpose, parameters, return value, and exceptions. Consider enhancing it further by adding an example usage section.
Add an example section like this:
""" Example: >>> df = pd.DataFrame({'col1': [1, 2], 'col2': ['a', 'b']}) >>> validated_df = validate_data(df, 'my_table') """
Line range hint
674-677
: Consider consolidating datetime validation.The start_date_time validation is duplicated in both
fetch_device_data_for_forecast_job
andfetch_device_data_for_satellite_job
. Consider extracting it into a utility method.def _validate_datetime(date_time: str, param_name: str = "date_time") -> None: """Validates if the given string is a valid datetime. Args: date_time: The datetime string to validate param_name: Name of the parameter for error message Raises: ValueError: If the datetime string is invalid """ try: pd.to_datetime(date_time) except ValueError: raise ValueError(f"Invalid {param_name}: {date_time}")Then use it in both methods:
def fetch_device_data_for_forecast_job(self, start_date_time: str, job_type: str): self._validate_datetime(start_date_time, "start date time") # rest of the method... def fetch_device_data_for_satellite_job(self, start_date_time: str, job_type: str): self._validate_datetime(start_date_time, "start date time") # rest of the method...Also applies to: 726-729
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/workflows/airqo_etl_utils/bigquery_api.py
(3 hunks)src/workflows/airqo_etl_utils/meta_data_utils.py
(1 hunks)src/workflows/airqo_etl_utils/schema/devices.json
(1 hunks)src/workflows/airqo_etl_utils/schema/sites.json
(2 hunks)
🔇 Additional comments (4)
src/workflows/airqo_etl_utils/schema/devices.json (1)
Line range hint 1-11
: Consider consolidating 'tenant' and 'network' fields
The schema currently maintains both 'tenant' and 'network' as REQUIRED fields. Based on the changes in meta_data_utils.py
where 'tenant' is being replaced with 'network', we should consider deprecating the 'tenant' field to maintain consistency across the codebase.
src/workflows/airqo_etl_utils/schema/sites.json (1)
7-11
: Maintain consistency with network/tenant approach
Similar to the devices schema, we have both 'tenant' and 'network' as REQUIRED fields. This should be consolidated as part of the migration from 'tenant' to 'network'.
src/workflows/airqo_etl_utils/bigquery_api.py (2)
150-153
: Improved logging implementation!
The switch from print statements to logger.warning enhances error tracking and maintains consistency with Python logging best practices.
270-270
: Verify the tenant to network parameter change impact.
The parameter rename from tenant
to network
in device_unique_col
method could affect existing code that uses this method.
Also applies to: 279-280
✅ Verification successful
Let me gather more information to verify if there are any other references to this parameter change.
Parameter rename from tenant to network is consistent
The change from tenant
to network
in the device_unique_col
method is localized and properly implemented. The method is only used once in the codebase for generating unique device identifiers, and the parameter change is consistently applied at both the definition and call site. Other tenant-related functionality in the codebase operates independently of this specific method.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any remaining references to 'tenant' in unique ID generation
# and verify the consistency of the network parameter usage
# Search for tenant-based unique ID generation
rg -A 2 "device_unique_col.*tenant"
# Search for network-based unique ID generation to confirm adoption
rg -A 2 "device_unique_col.*network"
Length of output: 361
Script:
#!/bin/bash
# Search for any other occurrences of tenant in the codebase
rg -A 2 "tenant.*device"
# Search for any imports or usage of the device_unique_col function
rg -A 2 "from.*bigquery_api.*import.*device_unique_col|device_unique_col\("
Length of output: 18285
"name": "last_updated", | ||
"type": "DATE", | ||
"mode": "NULLABLE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align date/time type with devices schema
The 'last_updated' field is defined as DATE in sites but TIMESTAMP in devices. This inconsistency could lead to confusion and potential issues in data analysis.
Consider using TIMESTAMP consistently across both schemas:
- "type": "DATE",
+ "type": "TIMESTAMP",
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
"name": "last_updated", | |
"type": "DATE", | |
"mode": "NULLABLE" | |
"name": "last_updated", | |
"type": "TIMESTAMP", | |
"mode": "NULLABLE" |
Description
[Provide a brief description of the changes made in this PR]
This PR corrects the updates to the devices and sites tables in bigquery. The data in these tables is used by the analytics application when making joins to gains some extra details.
Related Issues
Summary by CodeRabbit
last_updated
in device schema andnetwork
in site schema to enhance data tracking.