-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RisingWave Recovery Get Stuck if a hummock file is absent #19865
Comments
It also happened to me today with RisingWave 2.1.0 deployed through the Kubernetes operator (0.8.3). I'm currently using Hetzner Object Storage. |
Hi @marceloneppel,
|
Hi, @zwang28! Thanks for the support. I have attached the requested information. Log: 2024-12-30T10:54:53.767016944Z ERROR risingwave_object_store::object: read failed error=NotFound (permanent) at read, context: { uri: https://neppel-prod-database.fsn1.your-objectstorage.com/risingwave/119/177975.data, response: Parts { status: 404, version: HTTP/1.1, headers: {"content-length": "273", "x-amz-request-id": "tx0000075af18d11e51e05e-0067727bfd-8b2eee4-fsn1-prod1-ceph3", "accept-ranges": "bytes", "content-type": "application/xml", "date": "Mon, 30 Dec 2024 10:54:53 GMT", "x-debug-backend": "fsn1-prod1-ceph3", "strict-transport-security": "max-age=63072000", "x-debug-bucket": "neppel-prod-database"} }, service: s3, path: risingwave/119/177975.data, range: 33262700-33328157 } => S3Error { code: "NoSuchKey", message: "", resource: "", request_id: "tx0000075af18d11e51e05e-0067727bfd-8b2eee4-fsn1-prod1-ceph3" } Response from the query: prod=> select sstable_id,object_id,compaction_group_id,level_id,sub_level_id,level_type,right_exclusive,file_size,meta_offset,stale_key_count,total_key_count,min_epoch,max_epoch,uncompressed_file_size,range_tombstone_count,bloom_filter_kind,table_ids from rw_hummock_sstables where object_id=177975;
sstable_id | object_id | compaction_group_id | level_id | sub_level_id | level_type | right_exclusive | file_size | meta_offset | stale_key_count | total_key_count | min_epoch | max_epoch | uncompressed_file_size | range_tombstone_count | bloom_filter_kind | table_ids
------------+-----------+---------------------+----------+--------------+------------+-----------------+-----------+-------------+-----------------+-----------------+------------------+------------------+------------------------+-----------------------+-------------------+-------------------------------------
178023 | 177975 | 2 | 0 | 18158 | 2 | f | 44595725 | 44340751 | 0 | 247603 | 7733539191324672 | 7733539191324672 | 44589578 | 0 | 1 | [104, 105, 106, 107, 108, 109, 110]
(1 row) Range: 33262700-33328157 |
Can you check whether the object expiration lifecycle policy is set on the bucket used by RisingWave? |
If you have the logs for meta and compactor or the issue is reproduced, please also search the affecting object id (in this example it is 177975) from the meta and conpactor logs and share the relevant log lines. |
Sure. There is no lifecycle policy set in the bucket: ➜ ~ aws s3api get-bucket-lifecycle-configuration --endpoint-url=https://fsn1.your-objectstorage.com --bucket neppel-prod-database --debug
...
2024-12-30 14:02:16,544 - MainThread - botocore.parsers - DEBUG - Response body:
b'<?xml version="1.0" encoding="UTF-8"?><Error><Code>NoSuchLifecycleConfiguration</Code><Message></Message><BucketName>neppel-prod-database</BucketName><RequestId>tx0000031b25906a11cc5a6-00677299d8-93eda0e-fsn1-prod1-ceph3</RequestId><HostId>93eda0e-fsn1-prod1-ceph3-fsn1-prod1</HostId></Error>'
...
`` |
Only the compactor has logs for that object id (no meta logs mentioned it): kubectl logs -n prod risingwave-compactor-9468d469-6sfnn --tail=100 | grep 177975
2024-12-31T02:05:00.537287941Z WARN opendal::services: service=s3 name=neppel-prod-database path=risingwave/119/177975.data: stat failed NotFound (permanent) at stat, context: { uri: https://neppel-prod-database.fsn1.your-objectstorage.com/risingwave/119/177975.data, response: Parts { status: 404, version: HTTP/1.1, headers: {"content-length": "273", "x-amz-request-id": "tx000000554b65963cc87d7-006773514c-9266d8c-fsn1-prod1-ceph3", "accept-ranges": "bytes", "content-type": "application/xml", "date": "Tue, 31 Dec 2024 02:05:00 GMT", "x-debug-backend": "fsn1-prod1-ceph3", "strict-transport-security": "max-age=63072000", "x-debug-bucket": "neppel-prod-database"} }, service: s3, path: risingwave/119/177975.data }
2024-12-31T02:05:00.537480289Z ERROR risingwave_object_store::object: read failed error=NotFound (permanent) at stat, context: { uri: https://neppel-prod-database.fsn1.your-objectstorage.com/risingwave/119/177975.data, response: Parts { status: 404, version: HTTP/1.1, headers: {"content-length": "273", "x-amz-request-id": "tx000000554b65963cc87d7-006773514c-9266d8c-fsn1-prod1-ceph3", "accept-ranges": "bytes", "content-type": "application/xml", "date": "Tue, 31 Dec 2024 02:05:00 GMT", "x-debug-backend": "fsn1-prod1-ceph3", "strict-transport-security": "max-age=63072000", "x-debug-bucket": "neppel-prod-database"} }, service: s3, path: risingwave/119/177975.data }
Level 0 ["[id: 178023, obj_id: 177975 object_size 43550KB sst_size 42283KB stale_ratio 0]"] |
Hi @marceloneppel , we've reached out to you on Slack for quicker communication. Please check it when you have a moment. |
Hello, it looks like I am affected by the same issue. All compotents are looking for some If I delete the corresponding rows in |
Hi @maingoh , we've reached out to you on Slack for quicker communication. Please check it when you have a moment. |
It seem my issue was a bit different from the initial issue but I will still give my resolution in case it happens to someone else. As I was playing with the
Thank you @zwang28 for the debugging !! |
For anyone experiencing the
|
Describe the bug
Issue appeared yesterday with writing a barrier due to storage error 404.
System looks to be working but is not processing any new data.
Logs are full of repeated 404 errors with no way to stop.
Grafana barrier completion time and failed recovery is incrementing.
Restarting meta, compute, gets stuck in SELECT rw_recovery_status() = 'STARTING'
Looking for a way to skip / fix this. Don't mind recreating state from upstream topics.
Error message/log
To Reproduce
Looks to be transient storage issue which is non recoverable.
Expected behavior
Expected system to error to to give up trying to recover than being stuck in a loop for a day.
How did you deploy RisingWave?
Docker compose
The version of RisingWave
PostgreSQL 13.14.0-RisingWave-2.0.1 (0d15632)
Additional context
No response
The text was updated successfully, but these errors were encountered: