moving deduplication of log_sensor_file outside the loop #38

sid4py · 2023-10-30T23:32:14Z

The code was passing instance of the log_sensor_file table to the write_file_into function inside dataflow. For each call, the function would query the table and execute code to deduplicate the table. This happens 850 times nightly and the number will grow with as session folders grow. The table only needs to be queried and deduplicated once. That change is implemented in this PR. FYI function call times are:
Querying and dedup 10 times: 31.17 seconds
Querying once and dedup 10 times: 6.26 seconds
Querying once and dedup one time: 2.58 seconds

lwhite1 · 2023-10-31T13:30:39Z

scripts/dataflow_write_file_info.py

+        for i in x:
+            c = c+str(i)
+        return c
+    # Convert an array of filenames into a single concatenated string. If order of filename array


Could this be addressed by sorting here? Assuming the file name list isn't too long it might not be expensive.

The table structure is such that the file names are a list in a single cell - i.e. postgresql datatype of text[]. The list itself contains very few file names, often just one. For any sort operation in the filenames, they will still need to be unpacked. I chose this route since the apply() and drop_duplicates() methods are efficiently implemented in pandas. I think sorting on another column and then selecting unique lists of file names would also work.

lwhite1

Made one comment, otherwise LGTM

moving deduplication of log_sensor_file outside the loop

0905bdf

sid4py requested a review from lwhite1 October 30, 2023 23:32

lwhite1 reviewed Oct 31, 2023

View reviewed changes

lwhite1 approved these changes Oct 31, 2023

View reviewed changes

sid4py merged commit abb0916 into main Oct 31, 2023

sid4py deleted the dedup_log_sensor_file_once branch October 31, 2023 14:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

moving deduplication of log_sensor_file outside the loop #38

moving deduplication of log_sensor_file outside the loop #38

sid4py commented Oct 30, 2023

lwhite1 Oct 31, 2023

sid4py Oct 31, 2023

lwhite1 left a comment

moving deduplication of log_sensor_file outside the loop #38

moving deduplication of log_sensor_file outside the loop #38

Conversation

sid4py commented Oct 30, 2023

lwhite1 Oct 31, 2023

Choose a reason for hiding this comment

sid4py Oct 31, 2023

Choose a reason for hiding this comment

lwhite1 left a comment

Choose a reason for hiding this comment