Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

moving deduplication of log_sensor_file outside the loop #38

Merged
merged 1 commit into from
Oct 31, 2023

Conversation

sid4py
Copy link
Contributor

@sid4py sid4py commented Oct 30, 2023

The code was passing instance of the log_sensor_file table to the write_file_into function inside dataflow. For each call, the function would query the table and execute code to deduplicate the table. This happens 850 times nightly and the number will grow with as session folders grow. The table only needs to be queried and deduplicated once. That change is implemented in this PR. FYI function call times are:
Querying and dedup 10 times: 31.17 seconds
Querying once and dedup 10 times: 6.26 seconds
Querying once and dedup one time: 2.58 seconds

@sid4py sid4py requested a review from lwhite1 October 30, 2023 23:32
for i in x:
c = c+str(i)
return c
# Convert an array of filenames into a single concatenated string. If order of filename array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be addressed by sorting here? Assuming the file name list isn't too long it might not be expensive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table structure is such that the file names are a list in a single cell - i.e. postgresql datatype of text[]. The list itself contains very few file names, often just one. For any sort operation in the filenames, they will still need to be unpacked. I chose this route since the apply() and drop_duplicates() methods are efficiently implemented in pandas. I think sorting on another column and then selecting unique lists of file names would also work.

Copy link
Contributor

@lwhite1 lwhite1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made one comment, otherwise LGTM

@sid4py sid4py merged commit abb0916 into main Oct 31, 2023
@sid4py sid4py deleted the dedup_log_sensor_file_once branch October 31, 2023 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants