Batch loading sometimes missing a records #188

jeonguihyeong · 2022-04-06T05:40:35Z

If sink task write gcs file in parallel, gcs filename is dupulicated currently.
gcs filename added first offset for unique file name
If sink task write gcs file in same folder, other process delete other tables gcs file.
so seperated folder for tables
waiting job status for bigquery load job
adding log for job's suceeded rows count

Update GCSToBQLoadRunnable.java
If sink task write gcs file in parallel, gcs filename is dupulicated currently.
gcs filename added first offset for unique file name

If sink task write gcs file in same folder, other process delete other tables gcs file.
so seperated folder for tables

waiting job status for bigquery load job

adding log for job's suceeded rows count

Update BigQuerySinkTask.java
If sink task write gcs file in parallel, gcs filename is dupulicated currently.
gcs filename added first offset for unique file name

If sink task write gcs file in same folder, other process delete other tables gcs file.
so seperated folder for tables

If sink task write gcs file in parallel, gcs filename is dupulicated currently. gcs filename added first offset for unique file name If sink task write gcs file in same folder, other process delete other tables gcs file. so seperated folder for tables

If sink task write gcs file in parallel, gcs filename is dupulicated currently. gcs filename added first offset for unique file name If sink task write gcs file in same folder, other process delete other tables gcs file. so seperated folder for tables waiting job status for bigquery load job adding log for job's suceeded rows count

raphaelauv · 2022-04-06T10:07:24Z

Hey @jeonguihyeong could you reformulate your PR text , it's very complicate to understand your message , thanks

devinay · 2022-04-12T16:09:19Z

kcbq-connector/src/main/java/com/wepay/kafka/connect/bigquery/BigQuerySinkTask.java

@@ -247,7 +249,8 @@ public void put(Collection<SinkRecord> records) {
          TableWriterBuilder tableWriterBuilder;
          if (config.getList(BigQuerySinkConfig.ENABLE_BATCH_CONFIG).contains(record.topic())) {
            String topic = record.topic();
-            String gcsBlobName = topic + "_" + uuid + "_" + Instant.now().toEpochMilli();
+            long offset = record.kafkaOffset();
+            String gcsBlobName = topic + "_" + uuid + "_" + Instant.now().toEpochMilli()+"_"+records.size()+"_"+offset;


Would having a test case for validating that parallel puts create different files with the right offset help?

gcp side

you can use gcs versioning and how many times write gcs name.

bigquery sink

it's concept is used s3 sink. i convert to bigquery sink.
it is using first offset of records. and it is not corrupted because kafka message offset is unique.

if you need more comment, connect me.

b-goyal · 2022-12-07T05:16:52Z

We have not been able to reproduce missing records issue on our end.
@jeonguihyeong , since you mention you observed it at GCS filename duplication, could you help us with steps to replicate the 'GCS filename duplication' please ?

CC: @kapilchhajer @binoy-fernandez

jeonguihyeong added 2 commits April 6, 2022 05:16

Update BigQuerySinkTask.java

46ffde0

If sink task write gcs file in parallel, gcs filename is dupulicated currently. gcs filename added first offset for unique file name If sink task write gcs file in same folder, other process delete other tables gcs file. so seperated folder for tables

jeonguihyeong requested a review from a team as a code owner April 6, 2022 05:40

devinay reviewed Apr 12, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch loading sometimes missing a records #188

Batch loading sometimes missing a records #188

jeonguihyeong commented Apr 6, 2022

raphaelauv commented Apr 6, 2022

devinay Apr 12, 2022

jeonguihyeong Apr 13, 2022 •

edited

Loading

b-goyal commented Dec 7, 2022

Batch loading sometimes missing a records #188

Are you sure you want to change the base?

Batch loading sometimes missing a records #188

Conversation

jeonguihyeong commented Apr 6, 2022

raphaelauv commented Apr 6, 2022

devinay Apr 12, 2022

Choose a reason for hiding this comment

jeonguihyeong Apr 13, 2022 • edited Loading

Choose a reason for hiding this comment

b-goyal commented Dec 7, 2022

jeonguihyeong Apr 13, 2022 •

edited

Loading