Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash value comparison: new record is not loaded, if the same hash value exist in older record. #13

Open
nnikolov84 opened this issue Apr 17, 2024 · 4 comments

Comments

@nnikolov84
Copy link

When using hash value comparison, new record is not loaded, if the same hash value exist in older record.

Input data:

CUSTOMER_ID CUSTOMER_ADDRESS UPDATE_ON Comment
1 "ADDRESS_1" 1/1/2000 Initial address.
1 "ADDRESS_2" 1/20/2000 Customer changed address.
1 "ADDRESS_1" 1/22/2000 Customer changed back the address to the value from line 1.

The third value in the table will not be processed. This is happening, because when generating "ids_to_update", the historical records are not filtered.

with ids_to_update as 
   (select col1, HASH_COLUMN from `....TEST.TEST_TABLE_SRC_01`        
     except distinct         
   (select col1, HASH_COLUMN from `....TEST.source_data_scd_updates`)

In order for the process to work properly, only most recent record per KEY should be compared to the input data.
Possible solution is:

with ids_to_update as 
   (select col1, HASH_COLUMN from `....TEST.TEST_TABLE_SRC_01`        
     except distinct         
   (select col1, HASH_COLUMN from `....TEST.source_data_scd_updates`
      qualify row_number() over (partition by col1 order by updated_at desc) = 1)
nnikolov84 pushed a commit to nnikolov84/nnikolov-dataform-scd that referenced this issue Apr 17, 2024
Fix of: dataform-co#13
Fix of: new records are not loaded, if _updates table is empty.
@denziljoseph
Copy link

denziljoseph commented Jul 16, 2024

if your source table had unique records and it updated the record with new hash, will it be still an issue?

@nnikolov84
Copy link
Author

if your source table had unique records and it updated the record with new hash, will it be still an issue?

I'm not sure I understand what you mean.
The data coming from the source is unique, but off course it is changing from delivery to delivery. This being SCD, it should show how a record changed in time, but it fails to do so, if a value was already delivered in the past. *see the table my first comment.

@denziljoseph
Copy link

denziljoseph commented Jul 30, 2024

In addition to your input data you need to have an incremental model with customer_id as unique key. And this new incremental model should be the source to your scd.

@Liscor
Copy link

Liscor commented Sep 9, 2024

@nnikolov84 is correct. Take this example
That's the source table.
image

And this is the output after calling scd.
image
As you can see, the new row is not shown in the output table because the hash value from the last row is exactly the same as the hash from the second row.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants