Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[COST-4389] Masu endpoint to convert parquet data types #4837

Merged
merged 38 commits into from
Jan 15, 2024

Conversation

myersCody
Copy link
Contributor

@myersCody myersCody commented Dec 14, 2023

Jira Ticket

COST-4389

Description

This change will add an internal endpoint that downloads a parquet file from s3, reads the parquet schema from the local file, checks the data types, cast or transforms the data to the correct type, uploads the parquet file back up s3.

Testing

  1. Checkout Branch
  2. Restart Koku
  3. Create parquet files through the reindex problem that we already solved.
    • I created this script to create upload the parquet files to minio.

I have been running it by updating the the make shell-schema command:

# shell-schema:
# 	$(DJANGO_MANAGE) tenant_command shell --schema=$(schema) < create_reindex_parquet_files.py
  1. Now trigger the internal endpoint:
http://127.0.0.1:5042/api/cost-management/v1/fix_parquet/?schema=org1234567&start_date=2023-12-01&provider_type=AWS-local
  1. Check the logs:
koku-koku-worker-1  | [2024-01-09 16:52:58,688] INFO 93e34c4a-7438-4a25-920f-b696299970eb 44 {'message': 'Downloading file locally', 'tracing_id': 'b5707821-73bd-471f-ba32-9865eb9deba0', 'provider_type': 'AWS', 'provider_uuid': 'b5707821-73bd-471f-ba32-9865eb9deba0', 'schema': 'org1234567', 'simulate': False, 'bill_date': datetime.date(2024, 1, 1), 's3_prefix': 'data/parquet/org1234567/AWS/source=b5707821-73bd-471f-ba32-9865eb9deba0/year=2024/month=01', 's3_object_key': 'data/parquet/org1234567/AWS/source=b5707821-73bd-471f-ba32-9865eb9deba0/year=2024/month=01/test_b5707821-73bd-471f-ba32-9865eb9deba0.parquet'}
koku-koku-worker-1  | [2024-01-09 16:52:58,699] INFO 93e34c4a-7438-4a25-920f-b696299970eb 44 {'message': 'Checking local parquet_file', 'tracing_id': 'b5707821-73bd-471f-ba32-9865eb9deba0', 'provider_type': 'AWS', 'provider_uuid': 'b5707821-73bd-471f-ba32-9865eb9deba0', 'schema': 'org1234567', 'simulate': False, 'bill_date': datetime.date(2024, 1, 1), 's3_prefix': 'data/parquet/org1234567/AWS/source=b5707821-73bd-471f-ba32-9865eb9deba0/year=2024/month=01', 's3_object_key': 'data/parquet/org1234567/AWS/source=b5707821-73bd-471f-ba32-9865eb9deba0/year=2024/month=01/test_b5707821-73bd-471f-ba32-9865eb9deba0.parquet', 'local_file_path': '/testing/data/processing/org1234567/b5707821-73bd-471f-ba32-9865eb9deba0/test_b5707821-73bd-471f-ba32-9865eb9deba0.parquet'}
koku-koku-worker-1  | [2024-01-09 16:52:58,705] INFO 93e34c4a-7438-4a25-920f-b696299970eb 44 {'message': 'Incorrect data type, building new schema.', 'tracing_id': 'b5707821-73bd-471f-ba32-9865eb9deba0', 'provider_type': 'AWS', 'provider_uuid': 'b5707821-73bd-471f-ba32-9865eb9deba0', 'schema': 'org1234567', 'simulate': False, 'bill_date': datetime.date(2024, 1, 1), 's3_prefix': 'data/parquet/org1234567/AWS/source=b5707821-73bd-471f-ba32-9865eb9deba0/year=2024/month=01', 's3_object_key': 'data/parquet/org1234567/AWS/source=b5707821-73bd-471f-ba32-9865eb9deba0/year=2024/month=01/test_b5707821-73bd-471f-ba32-9865eb9deba0.parquet', 'column_name': 'bill_billingentity', 'current_dtype': DataType(double), 'expected_data_type': DataType(string)}
koku-koku-worker-1  | [2024-01-09 16:52:58,706] INFO 93e34c4a-7438-4a25-920f-b696299970eb 44 {'message': 'Incorrect data type, building new schema.', 'tracing_id': 'b5707821-73bd-471f-ba32-9865eb9deba0', 'provider_type': 'AWS', 'provider_uuid': 'b5707821-73bd-471f-ba32-9865eb9deba0', 'schema': 'org1234567', 'simulate': False, 'bill_date': datetime.date(2024, 1, 1), 's3_prefix': 'data/parquet/org1234567/AWS/source=b5707821-73bd-471f-ba32-9865eb9deba0/year=2024/month=01', 's3_object_key': 'data/parquet/org1234567/AWS/source=b5707821-73bd-471f-ba32-9865eb9deba0/year=2024/month=01/test_b5707821-73bd-471f-ba32-9865eb9deba0.parquet', 'column_name': 'lineitem_usagestartdate', 'current_dtype': DataType(double), 'expected_data_type': TimestampType(timestamp[ms, tz=UTC])}
koku-koku-worker-1  | [2024-01-09 16:52:58,762] INFO 93e34c4a-7438-4a25-920f-b696299970eb 44 {'message': 'Uploading revised parquet file.', 'tracing_id': 'b5707821-73bd-471f-ba32-9865eb9deba0', 'provider_type': 'AWS', 'provider_uuid': 'b5707821-73bd-471f-ba32-9865eb9deba0', 'schema': 'org1234567', 'simulate': False, 'bill_date': datetime.date(2024, 1, 1), 's3_object_key': 'data/parquet/daily/org1234567/AWS/raw/source=b5707821-73bd-471f-ba32-9865eb9deba0/year=2024/month=01/test_b5707821-73bd-471f-ba32-9865eb9deba0.parquet', 'local_file_path': '/testing/data/processing/org1234567/b5707821-73bd-471f-ba32-9865eb9deba0/test_b5707821-73bd-471f-ba32-9865eb9deba0.parquet'}
  1. Next check that the provider has been marked as successful in the database.
postgres=# select additional_context from api_provider where type='AWS-local';
                                                        additional_context                                                         
-----------------------------------------------------------------------------------------------------------------------------------
 {"conversion_metadata": {"2023-12-01": {"version": "0", "successful": true}, "2024-01-01": {"version": "0", "successful": true}}}
(1 row)
  1. Now try to retrigger the same endpoint:
http://127.0.0.1:5042/api/cost-management/v1/fix_parquet/?schema=org1234567&start_date=2023-12-01&provider_type=AWS-local

Note that the conversion was already marked as successful in the logs and no tasks were queued.

masu_server         | [2024-01-09 17:00:43,480] INFO None 29 {'message': 'Conversion already marked as successful', 'tracing_id': None, 'bill_date': '2023-12-01', 'provider_uuid': None}
masu_server         | [2024-01-09 17:00:43,480] INFO None 29 {'message': 'Conversion already marked as successful', 'tracing_id': None, 'bill_date': '2024-01-01', 'provider_uuid': None}
  1. Update one of the months to have an unsuccessful result:
UPDATE api_provider
SET additional_context = jsonb_set(
    additional_context,
    '{conversion_metadata, "2023-12-01", successful}',
    'false'::jsonb
)
WHERE type = 'AWS-local';
  1. Try the endpoint again and see that only one task was kicked off:
{
    "Async jobs for fix parquet files": "['53169a2e-3d7f-48cd-9a73-3f7815272c2b']"
}

Notes

...

@myersCody myersCody changed the title Cost 4389 fix parquet masu [COST-4389] Masu endpoint to convert parquet data types Dec 14, 2023
Copy link

codecov bot commented Dec 14, 2023

Codecov Report

Merging #4837 (bd374dc) into main (85e6f9d) will increase coverage by 0.0%.
The diff coverage is 96.2%.

Additional details and impacted files
@@          Coverage Diff           @@
##            main   #4837    +/-   ##
======================================
  Coverage   94.0%   94.0%            
======================================
  Files        365     371     +6     
  Lines      30310   30680   +370     
  Branches    3607    3661    +54     
======================================
+ Hits       28493   28848   +355     
- Misses      1158    1167     +9     
- Partials     659     665     +6     

@lcouzens lcouzens added the smoke-tests pr_check will build the image and run minimal required smokes label Jan 4, 2024
@lcouzens
Copy link
Contributor

lcouzens commented Jan 4, 2024

/retest

koku/masu/api/upgrade_trino/test/test_view.py Outdated Show resolved Hide resolved
koku/masu/api/upgrade_trino/test/test_view.py Outdated Show resolved Hide resolved
koku/masu/api/upgrade_trino/util/state_tracker.py Outdated Show resolved Hide resolved
koku/masu/api/upgrade_trino/util/verify_parquet_files.py Outdated Show resolved Hide resolved
koku/masu/api/upgrade_trino/util/verify_parquet_files.py Outdated Show resolved Hide resolved
koku/masu/api/upgrade_trino/util/verify_parquet_files.py Outdated Show resolved Hide resolved
koku/masu/api/upgrade_trino/view.py Outdated Show resolved Hide resolved
@lcouzens lcouzens removed the smoke-tests pr_check will build the image and run minimal required smokes label Jan 8, 2024
@myersCody myersCody marked this pull request as ready for review January 9, 2024 13:57
@myersCody myersCody requested review from a team as code owners January 9, 2024 13:57
@myersCody myersCody added the smoke-tests pr_check will build the image and run minimal required smokes label Jan 9, 2024
@myersCody
Copy link
Contributor Author

/retest

1 similar comment
@lcouzens
Copy link
Contributor

lcouzens commented Jan 9, 2024

/retest

Copy link

sonarcloud bot commented Jan 15, 2024

@myersCody myersCody merged commit ecfbf91 into main Jan 15, 2024
10 checks passed
@myersCody myersCody deleted the COST-4389-fix-parquet-masu branch January 15, 2024 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
smoke-tests pr_check will build the image and run minimal required smokes smokes-required
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants