Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clickhouse-backup: Add support for sharded backup #648

Merged
merged 1 commit into from
Aug 3, 2023

Conversation

mskwon
Copy link
Contributor

@mskwon mskwon commented Apr 26, 2023

This change adds a new configuration 'general.sharded_operation' which shards tables for backup across replicas, allowing for a uniform backup and restore call to the server without consideration for table replication state.

Fixes #639

@Slach Slach added this to the 2.3.0 milestone Apr 26, 2023
@mskwon mskwon force-pushed the sharded-backup branch 4 times, most recently from 1984421 to 82d747d Compare May 1, 2023 23:34
@mskwon
Copy link
Contributor Author

mskwon commented May 2, 2023

My bad pushing this without actually checking things out - I've made some amendments and did some testing

clickhouse-0:~$ clickhouse-backup tables
mkwontest.data         160B  default  full
mkwontest.nodelete     84B   default  schema-only
mkwontest_atomic.data  82B   default  schema-only

clickhouse-1:~$ clickhouse-backup tables
mkwontest.data         160B  default  schema-only
mkwontest.nodelete     84B   default  full
mkwontest_atomic.data  82B   default  schema-only

clickhouse-2:~$ clickhouse-backup tables
mkwontest.data         160B  default  schema-only
mkwontest.nodelete     84B   default  schema-only
mkwontest_atomic.data  82B   default  full

I hope you don't mind that I didn't add an integration test, as I noticed that the setup is single replica (?)

@Slach
Copy link
Collaborator

Slach commented May 2, 2023

this is massive commit need time to figure out how exactly it works

pkg/backup/backup_shard.go Outdated Show resolved Hide resolved
pkg/backup/backup_shard.go Outdated Show resolved Hide resolved
pkg/backup/backup_shard.go Outdated Show resolved Hide resolved
pkg/backup/backup_shard.go Outdated Show resolved Hide resolved
pkg/backup/backup_shard.go Outdated Show resolved Hide resolved
ReadMe.md Outdated Show resolved Hide resolved
pkg/backup/backup_shard.go Show resolved Hide resolved
pkg/backup/backup_shard_test.go Show resolved Hide resolved
pkg/backup/backup_shard_test.go Show resolved Hide resolved
pkg/backup/version.go Outdated Show resolved Hide resolved
@Slach
Copy link
Collaborator

Slach commented May 12, 2023

Please notofiy, if you don't have time to resolve conflicts and make pull request changes

@Slach
Copy link
Collaborator

Slach commented May 15, 2023

@mskwon any updates?

@mskwon
Copy link
Contributor Author

mskwon commented May 15, 2023

May I please have a little more time? I've started the changes though would like to add some different sharding modes, as I've realized that the consistency issue you pointed out could be pretty bad (for a hypothetical star schema, perhaps this could lead to the dimension and fact tables to become inconsistent - esp as transactions is experimental in clickhouse IIRC) - I'd like to make it possible to shard by database / have an option to keep certain tables backed up by the same replica.

@Slach
Copy link
Collaborator

Slach commented May 15, 2023

@mskwon great glad to hear it
What about just to detect a “first” replica in shard without sharding by table or database name?
Assuming that the data is consistent within a single replica, of course I mean not full “consistency”, cause ClickHouse doesn't support full ACID, so if your option will to allow to do the silmilar sequences which described here https://github.com/Altinity/clickhouse-backup/blob/master/Examples.md#how-to-make-backup--restore-sharded-cluster?

I mean when execute backup, then backup data only on the first replica in shard
When execute restore, then restore data only on the first replica in shard but restore schema on all replicas?

@Slach
Copy link
Collaborator

Slach commented Jun 11, 2023

@mskwon any news about this PR?

@Slach Slach modified the milestones: 2.3.0, 2.4.0 Jun 11, 2023
@Slach
Copy link
Collaborator

Slach commented Jun 14, 2023

oh, we rollback some zerolog related commits in master
=(

@Slach
Copy link
Collaborator

Slach commented Jun 14, 2023

looks like you tried to rebase
could you rebase again with fresh master?

@mskwon
Copy link
Contributor Author

mskwon commented Jun 14, 2023

Would it be possible to rerun the CI? I'm seeing a data race in the testflows related to the server, which I'm guessing is to be fixed by the zerolog changes

@Slach
Copy link
Collaborator

Slach commented Jun 15, 2023

I rollback zerolog migration (git reset --hard, git push --force origin master) and move these commits to separate branch and separate WIP PR
#670

Could you rebase your commits?
We don't need 11 unnecessary commits related to zerolog in your PR

@Slach
Copy link
Collaborator

Slach commented Jun 15, 2023

Pull request still show 11 commits instead of 2
could you rebase your commits?
https://chat.openai.com/share/02204c18-0061-4a42-93da-2fdbde75a854

@mskwon
Copy link
Contributor Author

mskwon commented Jun 23, 2023

Apologies for the poor responsiveness here, I was taking some time off.

It seems like the recent tests failed due to some S3 state

error one of upload table go-routine return error: can't upload: operation error S3: PutObject, serialization failed: serialization failed: input member Bucket must not be empty

@Slach
Copy link
Collaborator

Slach commented Jul 8, 2023

so, will we continue work for this PR?

@mskwon
Copy link
Contributor Author

mskwon commented Jul 10, 2023

Hey, it looks like what might be happening is that the clickhouse/clickhouse-server:23.3 image looks to be based off of Ubuntu Focal which doesn't have bsdextrautils - I'm not sure what a suitable replacement would be though will try to fix this issue in a separate PR

@mskwon
Copy link
Contributor Author

mskwon commented Jul 10, 2023

Hmmm, I rebased this off of your fix for bsdextrautils, but it seems like I'm still running into some problems with TestFIPS

2023/07/10 17:31:26.326098 error one of upload table go-routine return error: can't upload: operation error S3: PutObject, serialization failed: serialization failed: input member Bucket must not be empty

Running the following locally seems to have TestFIPS passing

% CLICKHOUSE_VERSION=20.3 RUN_TESTS=TestFIPS ./test/integration/run.sh

I'm not sure what is going on with my local setup, but running the following produces an error

+ cp -fl ./clickhouse-backup/clickhouse-backup-race-fips ./clickhouse-backup/clickhouse-backup-race-fips-docker
cp: ./clickhouse-backup/clickhouse-backup-race-fips-docker: Bad file descriptor
+ docker-compose -f /Users/mkwon/go/src/github.com/Altinity/clickhouse-backup/test/integration/docker-compose_advanced.yml up -d
[+] Running 11/11
 ✔ Network integration_clickhouse-backup       Created                                                                                                     0.0s 
 ✔ Network integration_default                 Created                                                                                                     0.0s 
 ✔ Container ftp                               Started                                                                                                     0.9s 
 ✔ Container pgsql                             Healthy                                                                                                    31.2s 
 ✔ Container sshd                              Started                                                                                                     0.8s 
 ✔ Container mysql                             Healthy                                                                                                    30.7s 
 ✔ Container minio                             Started                                                                                                     1.1s 
 ✔ Container zookeeper                         Healthy                                                                                                     4.2s 
 ✔ Container azure                             Started                                                                                                     1.1s 
 ✘ Container clickhouse                        Error                                                                                                      33.0s 
 ✔ Container integration-all_services_ready-1  Created                                                                                                     0.0s 
dependency failed to start: container clickhouse exited (126)

Still trying to debug...

@mskwon
Copy link
Contributor Author

mskwon commented Jul 24, 2023

Hmmm, seeing a different error now with your fixes:

clickhouse            | 2023.07.24 17:17:14.190807 [ 1 ] {} <Error> Application: Code: 36. DB::Exception: Bucket name length is out of bounds in virtual hosted style S3 URI: '' (https://storage.googleapis.com//clickhouse_backup_disk_gcs_over_s3/4d1142ff0bc3/): Cannot attach table `system`.`trace_log` from metadata file /var/lib/clickhouse/store/c45/c452789d-5a7b-4a44-9d33-5f79df5294af/trace_log.sql from query ATTACH TABLE system.trace_log UUID '58f6fae0-0c8c-446e-8238-4e0e14b5a814' (`event_date` Date, `event_time` DateTime, `event_time_microseconds` DateTime64(6), `timestamp_ns` UInt64, `revision` UInt32, `trace_type` Enum8('Real' = 0, 'CPU' = 1, 'Memory' = 2, 'MemorySample' = 3, 'MemoryPeak' = 4, 'ProfileEvent' = 5), `thread_id` UInt64, `query_id` String, `trace` Array(UInt64), `size` Int64, `event` LowCardinality(String), `increment` Int64) ENGINE = MergeTree PARTITION BY toYYYYMM(event_date) ORDER BY (event_date, event_time) SETTINGS index_granularity = 8192. (BAD_ARGUMENTS), Stack trace (when copying this message, always include the lines below):

@Slach
Copy link
Collaborator

Slach commented Jul 24, 2023

You need to create GCS bucket in https://console.cloud.google.com
and create HMAC Credentials in settings -> interoperability
https://console.cloud.google.com/storage/settings;tab=interoperability

and define the following 3 secrets in your fork

  • QA_GCS_OVER_S3_ACCESS_KEY
  • QA_GCS_OVER_S3_SECRET_KEY
  • QA_GCS_OVER_S3_BUCKET

@mskwon
Copy link
Contributor Author

mskwon commented Jul 24, 2023

Ah - my bad for not specifying, but this is for the Github actions - seems like QA_GCS_OVER_S3_BUCKET in the continuous integration .env might be hitting an issue

@Slach
Copy link
Collaborator

Slach commented Jul 25, 2023

@mskwon try to merge latest master, after it your test should pass 44f2d13

@mskwon
Copy link
Contributor Author

mskwon commented Jul 25, 2023

Hmmm, seems like the S3 bucket error came up again with the CI on TestFIPS:

2023/07/25 20:39:25.954610  warn BackupList bd.Walk return error: operation error S3: ListObjectsV2, serialization failed: serialization failed: input member Bucket must not be empty logger=s3
2023/07/25 20:40:55.970905  info clickhouse connection closed logger=clickhouse
2023/07/25 20:40:55.971011 error one of upload table go-routine return error: can't upload: operation error S3: PutObject, serialization failed: serialization failed: input member Bucket must not be empty

It's strange because it seems like the CI is passing for your changes, but I don't think this change impacts any of the tests unless there is some sort of matrix of configurations which is being tested which turns on the sharded operation mode. Is it the case that the test infrastructure would test sharded operation modes even without modification to the tests?

@Slach
Copy link
Collaborator

Slach commented Jul 26, 2023

create ./test/integration/.env (look env.example)
and try to run locally
S3_DEBUG=true RUN_TESTS=TestFIPS ./test/ingtegration/run.sh
and share output

@Slach
Copy link
Collaborator

Slach commented Jul 26, 2023

cp: ./clickhouse-backup/clickhouse-backup-race-fips-docker: Bad file descriptor
looks weird
could you run
make clean build-race-docker build-race-fips-docker manually?
and ensure file present?

ls -la ./clickhouse-backup/ ?

@Slach
Copy link
Collaborator

Slach commented Jul 26, 2023

also please merge latest master

@mskwon
Copy link
Contributor Author

mskwon commented Jul 31, 2023

cp: ./clickhouse-backup/clickhouse-backup-race-fips-docker: Bad file descriptor
looks weird
could you run
make clean build-race-docker build-race-fips-docker manually?
and ensure file present?

ls -la ./clickhouse-backup/ ?

clickhouse-backup % make clean build-race-docker build-race-fips-docker

... 

#18 [builder-race 7/7] COPY entrypoint.sh /entrypoint.sh
#18 DONE 0.0s

#19 [make-build-race-fips 1/1] COPY --from=builder-race /src/clickhouse-backup/ /src/clickhouse-backup/
#19 DONE 0.7s

#20 exporting to image
#20 exporting layers
#20 exporting layers 1.6s done
#20 writing image sha256:8b18c666e99467171c4f310b4197226f10ccd53553dcdc6e0655c8ccfa71d63c done
#20 naming to docker.io/library/clickhouse-backup:build-race-fips done
#20 DONE 1.6s
+ mkdir -pv ./clickhouse-backup
++ docker create clickhouse-backup:build-race-fips
+ DOCKER_ID=0ce30b0e2a3076831fe05350c872625a66d65cead24eb55fb7110a0041c6f5d1
+ docker cp -q 0ce30b0e2a3076831fe05350c872625a66d65cead24eb55fb7110a0041c6f5d1:/src/clickhouse-backup/clickhouse-backup-race-fips ./clickhouse-backup/
+ docker rm -f 0ce30b0e2a3076831fe05350c872625a66d65cead24eb55fb7110a0041c6f5d1
0ce30b0e2a3076831fe05350c872625a66d65cead24eb55fb7110a0041c6f5d1
+ cp -fl ./clickhouse-backup/clickhouse-backup-race-fips ./clickhouse-backup/clickhouse-backup-race-fips-docker
cp: ./clickhouse-backup/clickhouse-backup-race-fips-docker: Bad file descriptor
clickhouse-backup % ls -la
total 512
drwxr-xr-x  23 mkwon  staff    736 Jul 31 15:34 .
drwxr-xr-x   3 mkwon  staff     96 Jun 14 09:23 ..
-rw-r--r--   1 mkwon  staff     24 Jun 27 10:39 .dockerignore
drwxr-xr-x  16 mkwon  staff    512 Jul 31 15:23 .git
drwxr-xr-x   3 mkwon  staff     96 Apr 14 12:10 .github
-rw-r--r--   1 mkwon  staff    187 Jun 27 10:39 .gitignore
-rw-r--r--   1 mkwon  staff  36424 Jul 31 14:45 ChangeLog.md
-rw-r--r--   1 mkwon  staff   6748 Jul 31 14:45 Dockerfile
-rw-r--r--   1 mkwon  staff  31741 Jul 31 14:45 Examples.md
-rw-r--r--   1 mkwon  staff   1166 Apr 14 12:10 LICENSE
-rw-r--r--   1 mkwon  staff   7590 Jul 24 10:01 Makefile
-rw-r--r--   1 mkwon  staff  18980 Jul 24 10:01 Manual.md
-rw-r--r--   1 mkwon  staff  52934 Jul 31 14:45 ReadMe.md
-rw-r--r--   1 mkwon  staff   6244 Jun 27 10:39 Vagrantfile
drwxr-xr-x   6 mkwon  staff    192 Jul 31 15:36 clickhouse-backup
drwxr-xr-x   3 mkwon  staff     96 Apr 14 12:10 cmd
-rw-r--r--   1 mkwon  staff    168 Apr 14 12:10 entrypoint.sh
-rw-r--r--   1 mkwon  staff    454 Apr 14 12:10 generate_manual.sh
-rw-r--r--   1 mkwon  staff   6159 Jun 27 10:39 go.mod
-rw-r--r--   1 mkwon  staff  64933 Jul 24 10:00 go.sum
drwxr-xr-x  18 mkwon  staff    576 Jun 23 10:44 pkg
drwxr-xr-x   5 mkwon  staff    160 Apr 14 12:10 test
drwxr-xr-x  11 mkwon  staff    352 Jun 23 10:54 vendor

@Slach
Copy link
Collaborator

Slach commented Aug 1, 2023

cp -fl ./clickhouse-backup/clickhouse-backup-race-fips ./clickhouse-backup/clickhouse-backup-race-fips-docker
cp: ./clickhouse-backup/clickhouse-backup-race-fips-docker: Bad file descriptor

Which OS do you use?

what show
ls -la ./clickhouse-backup/?

Could you merge the latest master to your branch? Did you force push after git pull --rebase?

@mskwon
Copy link
Contributor Author

mskwon commented Aug 1, 2023

I'm using MacOS

clickhouse-backup % ls -la ./clickhouse-backup/
total 496432
drwxr-xr-x   6 mkwon  staff       192 Jul 31 15:36 .
drwxr-xr-x  23 mkwon  staff       736 Jul 31 15:34 ..
-rwxr-xr-x   2 mkwon  staff  62368304 Jul 31 15:32 clickhouse-backup-race
-rwxr-xr-x   2 mkwon  staff  62368304 Jul 31 15:32 clickhouse-backup-race-docker
-rwxr-xr-x   2 mkwon  staff  64716064 Jul 31 15:36 clickhouse-backup-race-fips
-rwxr-xr-x   2 mkwon  staff  64716064 Jul 31 15:36 clickhouse-backup-race-fips-docker

The latest push should be on top of Altiniy:master, at least according to this?
master...mskwon:clickhouse-backup:sharded-backup

@Slach
Copy link
Collaborator

Slach commented Aug 1, 2023

temporary replace cp -fl to cp -fv in Makefile

@Slach
Copy link
Collaborator

Slach commented Aug 2, 2023

please merge latest master changes (TestFIPS should skip for your for) and properly resolve conflicts

This change adds a new configuration 'general.sharded_operation'
which shards tables for backup across replicas, allowing for a
uniform backup and restore call to the server without consideration
for table replication state.

Fixes Altinity#639

clickhouse-backup/backup_shard: Use Array for active replicas

clickhouse-go v1 does not support clickhouse Map types. Force the
Map(String, UInt8) column replica_is_active to a string array for
now.

clickhouse-backup/backuper: Skip shard assignment for skipped tables

Skip shard assignment for skipped tables. Also add the new
ShardBackupType "ShardBackupNone", which is assigned to skipped
tables

clickhouse-backup/backuper: Use b.GetTables for CreateBackup

Use b.GetTables for CreateBackup instead of b.ch.GetTables and move
b.populateBackupShardField to b.GetTables so as to populate the
field for the server API.

backup: Addressing changes for adding sharding support

Add in different sharded operation modes to give users the ability
to specify granularity of sharding
@coveralls
Copy link

coveralls commented Aug 2, 2023

Pull Request Test Coverage Report for Build 5743982693

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 155 of 168 (92.26%) changed or added relevant lines in 5 files are covered.
  • 238 unchanged lines in 9 files lost coverage.
  • Overall coverage decreased (-1.7%) to 64.343%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/backup/create.go 5 7 71.43%
pkg/backup/list.go 2 4 50.0%
pkg/clickhouse/version.go 9 12 75.0%
pkg/backup/backuper.go 48 54 88.89%
Files with Coverage Reduction New Missed Lines %
pkg/backup/restore.go 2 67.41%
pkg/metadata/load.go 2 81.82%
pkg/storage/object_disk/object_disk.go 5 67.6%
pkg/config/config.go 10 69.97%
pkg/status/status.go 12 70.2%
pkg/server/server.go 23 52.15%
pkg/storage/general.go 24 58.15%
pkg/storage/s3.go 46 59.44%
pkg/storage/gcs.go 114 0.0%
Totals Coverage Status
Change from base Build 5742072024: -1.7%
Covered Lines: 7348
Relevant Lines: 11420

💛 - Coveralls

@Slach Slach merged commit 7ce6103 into Altinity:master Aug 3, 2023
@Slach
Copy link
Collaborator

Slach commented Aug 3, 2023

thanks for contributing

Slach added a commit that referenced this pull request Aug 3, 2023
added to changelog:

Add `SHARDED_OPERATION_MODE` option, to easy create backup for sharded cluster, available values none (no sharding), table (table granularity), database (database granularity), first-replica (on the lexicographically sorted first active replica), thanks @mskwon, fix [639](#639), fix [648](#648)
@mskwon
Copy link
Contributor Author

mskwon commented Aug 3, 2023

Thank you for your patience and the huge amount of help with this, Slach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hash-Mod sharded backups / restore of replicated tables
3 participants