Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [benchmark][cluster] dql request timeout in concurrent dql & multi-partition scene #38275

Closed
1 task done
wangting0128 opened this issue Dec 6, 2024 · 6 comments
Closed
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@wangting0128
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-20241206-d7a5ad4e-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):2.5.0rc124
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task: fouramf-concurrent-not-found
test case name: test_hybrid_search_locust_dql_dml_partition_hybrid_search_cluster

server:

NAME                                                              READY   STATUS      RESTARTS       AGE     IP              NODE         NOMINATED NODE   READINESS GATES
fouramf-concurrent-not-found-1-etcd-0                             1/1     Running     0              3h7m    10.104.33.7     4am-node36   <none>           <none>
fouramf-concurrent-not-found-1-etcd-1                             1/1     Running     0              3h7m    10.104.21.88    4am-node24   <none>           <none>
fouramf-concurrent-not-found-1-etcd-2                             1/1     Running     0              3h7m    10.104.32.197   4am-node39   <none>           <none>
fouramf-concurrent-not-found-1-milvus-datanode-7f6dffd5dd-7zd2r   1/1     Running     2 (3h6m ago)   3h7m    10.104.27.186   4am-node31   <none>           <none>
fouramf-concurrent-not-found-1-milvus-indexnode-59bbdf4b5-2rh5n   1/1     Running     1 (3h6m ago)   3h7m    10.104.30.142   4am-node38   <none>           <none>
fouramf-concurrent-not-found-1-milvus-indexnode-59bbdf4b5-fxrfq   1/1     Running     2 (3h6m ago)   3h7m    10.104.16.55    4am-node21   <none>           <none>
fouramf-concurrent-not-found-1-milvus-indexnode-59bbdf4b5-kcvf7   1/1     Running     2 (3h6m ago)   3h7m    10.104.27.188   4am-node31   <none>           <none>
fouramf-concurrent-not-found-1-milvus-indexnode-59bbdf4b5-kpfxf   1/1     Running     1 (3h6m ago)   3h7m    10.104.14.222   4am-node18   <none>           <none>
fouramf-concurrent-not-found-1-milvus-mixcoord-6664575c86-wkp7p   1/1     Running     2 (3h7m ago)   3h7m    10.104.27.187   4am-node31   <none>           <none>
fouramf-concurrent-not-found-1-milvus-proxy-7f579d96b6-zt9l8      1/1     Running     2 (3h6m ago)   3h7m    10.104.17.223   4am-node23   <none>           <none>
fouramf-concurrent-not-found-1-milvus-querynode-f4fc4dc45-hjrt8   1/1     Running     2 (3h6m ago)   3h7m    10.104.17.225   4am-node23   <none>           <none>
fouramf-concurrent-not-found-1-milvus-querynode-f4fc4dc45-hmh86   1/1     Running     2 (3h6m ago)   3h7m    10.104.25.153   4am-node30   <none>           <none>
fouramf-concurrent-not-found-1-minio-0                            1/1     Running     0              3h7m    10.104.25.154   4am-node30   <none>           <none>
fouramf-concurrent-not-found-1-minio-1                            1/1     Running     0              3h7m    10.104.33.6     4am-node36   <none>           <none>
fouramf-concurrent-not-found-1-minio-2                            1/1     Running     0              3h7m    10.104.20.71    4am-node22   <none>           <none>
fouramf-concurrent-not-found-1-minio-3                            1/1     Running     0              3h7m    10.104.15.34    4am-node20   <none>           <none>
fouramf-concurrent-not-found-1-pulsarv3-bookie-0                  1/1     Running     0              3h7m    10.104.33.16    4am-node36   <none>           <none>
fouramf-concurrent-not-found-1-pulsarv3-bookie-1                  1/1     Running     0              3h7m    10.104.32.204   4am-node39   <none>           <none>
fouramf-concurrent-not-found-1-pulsarv3-bookie-2                  1/1     Running     0              3h7m    10.104.15.35    4am-node20   <none>           <none>
fouramf-concurrent-not-found-1-pulsarv3-bookie-init-b95r7         0/1     Completed   0              3h7m    10.104.30.141   4am-node38   <none>           <none>
fouramf-concurrent-not-found-1-pulsarv3-broker-0                  1/1     Running     0              3h7m    10.104.21.84    4am-node24   <none>           <none>
fouramf-concurrent-not-found-1-pulsarv3-broker-1                  1/1     Running     0              3h7m    10.104.20.65    4am-node22   <none>           <none>
fouramf-concurrent-not-found-1-pulsarv3-proxy-0                   1/1     Running     0              3h7m    10.104.21.85    4am-node24   <none>           <none>
fouramf-concurrent-not-found-1-pulsarv3-proxy-1                   1/1     Running     0              3h7m    10.104.9.66     4am-node14   <none>           <none>
fouramf-concurrent-not-found-1-pulsarv3-pulsar-init-92b6d         0/1     Completed   0              3h7m    10.104.21.83    4am-node24   <none>           <none>
fouramf-concurrent-not-found-1-pulsarv3-recovery-0                1/1     Running     0              3h7m    10.104.9.65     4am-node14   <none>           <none>
fouramf-concurrent-not-found-1-pulsarv3-zookeeper-0               1/1     Running     0              3h7m    10.104.32.195   4am-node39   <none>           <none>
fouramf-concurrent-not-found-1-pulsarv3-zookeeper-1               1/1     Running     0              3h7m    10.104.20.68    4am-node22   <none>           <none>
fouramf-concurrent-not-found-1-pulsarv3-zookeeper-2               1/1     Running     0              3h7m    10.104.15.29    4am-node20   <none>           <none>

{pod=~"fouramf-concurrent-not-found-1-milvus-proxy-7f579d96b6-zt9l8"} |~ "5d26cb436d1ae7dd6dc18cab24c1b7cc"

The requery took a long time and caused timeout

image

client log:
截屏2024-12-06 14 33 00

Expected Behavior

No response

Steps To Reproduce

concurrent test and calculation of RT and QPS

        :purpose:  `DQL & DML(partition)`
            verify concurrent DQL & DML(partition) scenario,
            which has 4 vector fields(IVF_FLAT, HNSW, DISKANN, IVF_SQ8) and scalar fields: `int64_1`, `varchar_1`

        :test steps:
            1. create collection with fields:
                'float_vector': 128dim,
                'float_vector_1': 128dim,
                'float_vector_2': 128dim,
                'float_vector_3': 128dim,
                scalar field: int64_1, varchar_1
            2. build indexes:
                IVF_FLAT: 'float_vector'
                HNSW: 'float_vector_1',
                DISKANN: 'float_vector_2'
                IVF_SQ8: 'float_vector_3'
                INVERTED: 'int64_1', 'varchar_1'
                default scalar index: 'id'
            3. insert 1 million data into 10 partitions
            4. flush collection
            5. build indexes again using the same params
            6. load collection
                replica: 1
            7. concurrent request:
                - scene_test_partition_hybrid_search
                    (partition: create->insert->flush->index again->load->hybrid_search->release->hybrid_search failed->drop)
                - search
                - hybrid_search
                - query

Milvus Log

No response

Anything else?

test result:

[2024-12-06 06:14:20,588 -  INFO - fouram]: Print locust final stats. (locust_runner.py:56)
[2024-12-06 06:14:20,588 -  INFO - fouram]: Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s (stats.py:789)
[2024-12-06 06:14:20,588 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-12-06 06:14:20,588 -  INFO - fouram]: grpc     hybrid_search                                                                    188   78(41.49%) | 257724     433  600086  28000 |    0.02        0.01 (stats.py:789)
[2024-12-06 06:14:20,588 -  INFO - fouram]: grpc     query                                                                             27    9(33.33%) | 219319     207  600280  52000 |    0.00        0.00 (stats.py:789)
[2024-12-06 06:14:20,588 -  INFO - fouram]: grpc     scene_test_partition_hybrid_search                                                 6   6(100.00%) |1577670  608966 22908401061000 |    0.00        0.00 (stats.py:789)
[2024-12-06 06:14:20,588 -  INFO - fouram]: grpc     search                                                                           204   62(30.39%) | 207997   10877  600084  47000 |    0.02        0.01 (stats.py:789)
[2024-12-06 06:14:20,588 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-12-06 06:14:20,588 -  INFO - fouram]:          Aggregated                                                                       425  155(36.47%) | 250050     207 2290840  44000 |    0.04        0.01 (stats.py:789)
[2024-12-06 06:14:20,588 -  INFO - fouram]:  (stats.py:790)
[2024-12-06 06:14:20,591 -  INFO - fouram]: [PerfTemplate] Report data: 
{'server': {'deploy_tool': 'helm',
            'deploy_mode': 'cluster',
            'config_name': 'cluster_2c8m',
            'config': {'queryNode': {'resources': {'limits': {'cpu': '32.0', 'memory': '32Gi'}, 'requests': {'cpu': '17.0', 'memory': '17Gi'}}, 'replicas': 2},
                       'indexNode': {'resources': {'limits': {'cpu': '8.0', 'memory': '8Gi'}, 'requests': {'cpu': '5.0', 'memory': '5Gi'}}, 'replicas': 4},
                       'dataNode': {'resources': {'limits': {'cpu': '2.0', 'memory': '8Gi'}, 'requests': {'cpu': '2.0', 'memory': '5Gi'}}},
                       'cluster': {'enabled': True},
                       'pulsarv3': {},
                       'kafka': {},
                       'minio': {'metrics': {'podMonitor': {'enabled': True}}},
                       'etcd': {'metrics': {'enabled': True, 'podMonitor': {'enabled': True}}},
                       'metrics': {'serviceMonitor': {'enabled': True}},
                       'log': {'level': 'debug'},
                       'image': {'all': {'repository': 'harbor.milvus.io/milvus/milvus', 'tag': 'master-20241206-d7a5ad4e-amd64'}}},
            'host': 'fouramf-concurrent-not-found-1-milvus.qa-milvus.svc.cluster.local',
            'port': '19530',
            'uri': ''},
 'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_hybrid_search_locust_dql_dml_partition_hybrid_search_cluster',
            'test_case_params': {'dataset_params': {'metric_type': 'L2',
                                                    'dim': 128,
                                                    'scalars_index': {'id': {}, 'int64_1': {'index_type': 'INVERTED'}, 'varchar_1': {'index_type': 'INVERTED'}},
                                                    'vectors_index': {'float_vector_1': {'index_type': 'HNSW',
                                                                                         'index_param': {'M': 8, 'efConstruction': 200},
                                                                                         'metric_type': 'L2'},
                                                                      'float_vector_2': {'index_type': 'DISKANN', 'index_param': {}, 'metric_type': 'IP'},
                                                                      'float_vector_3': {'index_type': 'IVF_SQ8',
                                                                                         'index_param': {'nlist': 2048},
                                                                                         'metric_type': 'L2'}},
                                                    'scalars_params': {'float_vector_1': {'params': {'dim': 128}, 'other_params': {'dataset': 'sift'}},
                                                                       'float_vector_2': {'params': {'dim': 128}, 'other_params': {'dataset': 'sift'}},
                                                                       'float_vector_3': {'params': {'dim': 128}, 'other_params': {'dataset': 'sift'}}},
                                                    'extra_partitions': {'partitions': ['_default', 'partition_1', 'partition_2', 'partition_3', 'partition_4',
                                                                                        'partition_5', 'partition_6', 'partition_7', 'partition_8',
                                                                                        'partition_9'],
                                                                         'data_repeated': False},
                                                    'dataset_name': 'sift',
                                                    'dataset_size': 1000000,
                                                    'ni_per': 10000},
                                 'collection_params': {'other_fields': ['float_vector_1', 'float_vector_2', 'float_vector_3', 'int64_1', 'varchar_1'],
                                                       'shards_num': 2},
                                 'resource_groups_params': {'reset': False},
                                 'database_user_params': {'reset_rbac': False, 'reset_db': False},
                                 'index_params': {'index_type': 'IVF_FLAT', 'index_param': {'nlist': 1024}},
                                 'concurrent_params': {'concurrent_number': 20, 'during_time': '3h', 'interval': 20, 'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'scene_test_partition_hybrid_search',
                                                       'weight': 1,
                                                       'params': {'nq': 1,
                                                                  'top_k': 1,
                                                                  'reqs': [{'search_param': {'nprobe': 128}, 'anns_field': 'float_vector', 'top_k': 100},
                                                                           {'search_param': {'ef': 64}, 'anns_field': 'float_vector_1', 'top_k': 10},
                                                                           {'search_param': {'search_list': 32}, 'anns_field': 'float_vector_2', 'top_k': 30},
                                                                           {'search_param': {'nprobe': 16}, 'anns_field': 'float_vector_3', 'top_k': 400}],
                                                                  'rerank': {'RRFRanker': []},
                                                                  'output_fields': ['*'],
                                                                  'ignore_growing': False,
                                                                  'guarantee_timestamp': None,
                                                                  'timeout': 600,
                                                                  'random_data': True,
                                                                  'hybrid_search_counts': 1,
                                                                  'data_size': 3000,
                                                                  'ni': 3000}},
                                                      {'type': 'search',
                                                       'weight': 8,
                                                       'params': {'nq': 1000,
                                                                  'top_k': 1,
                                                                  'search_param': {'nprobe': 1000},
                                                                  'expr': 'int64_1 >= 0',
                                                                  'guarantee_timestamp': None,
                                                                  'partition_names': ['_default', 'partition_1', 'partition_2', 'partition_3', 'partition_4',
                                                                                      'partition_5', 'partition_6', 'partition_7', 'partition_8',
                                                                                      'partition_9'],
                                                                  'output_fields': None,
                                                                  'ignore_growing': False,
                                                                  'group_by_field': None,
                                                                  'timeout': 600,
                                                                  'random_data': True,
                                                                  'check_task': 'check_response',
                                                                  'check_items': None}},
                                                      {'type': 'hybrid_search',
                                                       'weight': 8,
                                                       'params': {'nq': 1,
                                                                  'top_k': 100,
                                                                  'reqs': [{'search_param': {'nprobe': 128}, 'anns_field': 'float_vector', 'top_k': 100},
                                                                           {'search_param': {'ef': 64}, 'anns_field': 'float_vector_1', 'top_k': 10},
                                                                           {'search_param': {'search_list': 32}, 'anns_field': 'float_vector_2', 'top_k': 30},
                                                                           {'search_param': {'nprobe': 16}, 'anns_field': 'float_vector_3', 'top_k': 400}],
                                                                  'rerank': {'WeightedRanker': [0.85, 0.95, 0.51, 0.32]},
                                                                  'output_fields': ['*'],
                                                                  'ignore_growing': False,
                                                                  'guarantee_timestamp': None,
                                                                  'partition_names': ['_default', 'partition_1', 'partition_2', 'partition_3', 'partition_4',
                                                                                      'partition_5', 'partition_6', 'partition_7', 'partition_8',
                                                                                      'partition_9'],
                                                                  'timeout': 600,
                                                                  'random_data': True,
                                                                  'check_task': 'check_response',
                                                                  'check_items': None}},
                                                      {'type': 'query',
                                                       'weight': 1,
                                                       'params': {'ids': None,
                                                                  'expr': 'int64_1 > -1 && ',
                                                                  'output_fields': ['*'],
                                                                  'offset': None,
                                                                  'limit': None,
                                                                  'ignore_growing': False,
                                                                  'partition_names': ['_default', 'partition_1', 'partition_2', 'partition_3', 'partition_4',
                                                                                      'partition_5', 'partition_6', 'partition_7', 'partition_8',
                                                                                      'partition_9'],
                                                                  'timeout': 600,
                                                                  'consistency_level': None,
                                                                  'random_data': True,
                                                                  'random_count': 20,
                                                                  'random_range': [0, 100000],
                                                                  'field_name': 'id',
                                                                  'field_type': 'int64',
                                                                  'custom_expr': None,
                                                                  'custom_range': [0, 1],
                                                                  'check_task': 'check_response',
                                                                  'check_items': None}}]},
            'run_id': 2024120644573050,
            'datetime': '2024-12-06 03:07:37.396115',
            'client_version': '2.2'},
 'result': {'test_result': {'index': {'RT': 38.2989,
                                      'float_vector_1': {'RT': 2.0264},
                                      'float_vector_2': {'RT': 6.0587},
                                      'float_vector_3': {'RT': 0.5195},
                                      'id': {'RT': 0.5231},
                                      'int64_1': {'RT': 0.5162},
                                      'varchar_1': {'RT': 0.5166}},
                            'insert': {'total_time': 124.3142, 'VPS': 8047.1507, 'batch_time': 1.2432, 'batch': 10000.0},
                            'flush': {'RT': 3.0397},
                            'load': {'RT': 3.6735},
                            'Locust': {'Aggregated': {'Requests': 425,
                                                      'Fails': 155,
                                                      'RPS': 0.04,
                                                      'fail_s': 0.36,
                                                      'RT_max': 2290840.11,
                                                      'RT_avg': 250050.08,
                                                      'TP50': 44000.0,
                                                      'TP99': 983000.0},
                                       'hybrid_search': {'Requests': 188,
                                                         'Fails': 78,
                                                         'RPS': 0.02,
                                                         'fail_s': 0.41,
                                                         'RT_max': 600086.13,
                                                         'RT_avg': 257724.02,
                                                         'TP50': 28000.0,
                                                         'TP99': 600000.0},
                                       'query': {'Requests': 27,
                                                 'Fails': 9,
                                                 'RPS': 0.0,
                                                 'fail_s': 0.33,
                                                 'RT_max': 600280.24,
                                                 'RT_avg': 219319.4,
                                                 'TP50': 52000.0,
                                                 'TP99': 600000.0},
                                       'scene_test_partition_hybrid_search': {'Requests': 6,
                                                                              'Fails': 6,
                                                                              'RPS': 0.0,
                                                                              'fail_s': 1.0,
                                                                              'RT_max': 2290840.11,
                                                                              'RT_avg': 1577670.37,
                                                                              'TP50': 2237000.0,
                                                                              'TP99': 2291000.0},
                                       'search': {'Requests': 204,
                                                  'Fails': 62,
                                                  'RPS': 0.02,
                                                  'fail_s': 0.3,
                                                  'RT_max': 600084.47,
                                                  'RT_avg': 207997.66,
                                                  'TP50': 47000.0,
                                                  'TP99': 600000.0}}}}}
@wangting0128 wangting0128 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. test/benchmark benchmark test labels Dec 6, 2024
@wangting0128 wangting0128 added this to the 2.5.0 milestone Dec 6, 2024
@wangting0128
Copy link
Contributor Author

wangting0128 commented Dec 6, 2024

different case,same error

argo task: fouramf-concurrent-not-found
test case name: test_hybrid_search_locust_dql_dml_partition_cluster
image: master-20241206-d7a5ad4e-amd64

server:

NAME                                                              READY   STATUS      RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
fouramf-concurrent-not-found-2-etcd-0                             1/1     Running     0               3h16m   10.104.15.42    4am-node20   <none>           <none>
fouramf-concurrent-not-found-2-etcd-1                             1/1     Running     0               3h16m   10.104.30.165   4am-node38   <none>           <none>
fouramf-concurrent-not-found-2-etcd-2                             1/1     Running     0               3h16m   10.104.24.223   4am-node29   <none>           <none>
fouramf-concurrent-not-found-2-milvus-datanode-86cc7fdff4-82xqr   1/1     Running     3 (3h15m ago)   3h16m   10.104.27.189   4am-node31   <none>           <none>
fouramf-concurrent-not-found-2-milvus-indexnode-7bd46f56859sxz4   1/1     Running     3 (3h15m ago)   3h16m   10.104.16.58    4am-node21   <none>           <none>
fouramf-concurrent-not-found-2-milvus-indexnode-7bd46f5685qv5xl   1/1     Running     3 (3h14m ago)   3h16m   10.104.9.74     4am-node14   <none>           <none>
fouramf-concurrent-not-found-2-milvus-indexnode-7bd46f5685x8bht   1/1     Running     3 (3h15m ago)   3h16m   10.104.14.225   4am-node18   <none>           <none>
fouramf-concurrent-not-found-2-milvus-indexnode-7bd46f5685z9tvk   1/1     Running     4 (3h14m ago)   3h16m   10.104.34.190   4am-node37   <none>           <none>
fouramf-concurrent-not-found-2-milvus-mixcoord-7c7476b5d8-2fwd8   1/1     Running     3 (3h15m ago)   3h16m   10.104.16.57    4am-node21   <none>           <none>
fouramf-concurrent-not-found-2-milvus-proxy-8864d876b-cwrn6       1/1     Running     3 (3h15m ago)   3h16m   10.104.16.56    4am-node21   <none>           <none>
fouramf-concurrent-not-found-2-milvus-querynode-6446df69c76vx5h   1/1     Running     0               3h16m   10.104.27.190   4am-node31   <none>           <none>
fouramf-concurrent-not-found-2-minio-0                            1/1     Running     0               3h16m   10.104.18.33    4am-node25   <none>           <none>
fouramf-concurrent-not-found-2-minio-1                            1/1     Running     0               3h16m   10.104.30.164   4am-node38   <none>           <none>
fouramf-concurrent-not-found-2-minio-2                            1/1     Running     0               3h16m   10.104.32.215   4am-node39   <none>           <none>
fouramf-concurrent-not-found-2-minio-3                            1/1     Running     0               3h16m   10.104.15.45    4am-node20   <none>           <none>
fouramf-concurrent-not-found-2-pulsarv3-bookie-0                  1/1     Running     0               3h16m   10.104.30.163   4am-node38   <none>           <none>
fouramf-concurrent-not-found-2-pulsarv3-bookie-1                  1/1     Running     0               3h16m   10.104.24.221   4am-node29   <none>           <none>
fouramf-concurrent-not-found-2-pulsarv3-bookie-2                  1/1     Running     0               3h16m   10.104.32.216   4am-node39   <none>           <none>
fouramf-concurrent-not-found-2-pulsarv3-bookie-init-s6z5x         0/1     Completed   0               3h16m   10.104.9.71     4am-node14   <none>           <none>
fouramf-concurrent-not-found-2-pulsarv3-broker-0                  1/1     Running     0               3h16m   10.104.24.211   4am-node29   <none>           <none>
fouramf-concurrent-not-found-2-pulsarv3-broker-1                  1/1     Running     0               3h16m   10.104.21.92    4am-node24   <none>           <none>
fouramf-concurrent-not-found-2-pulsarv3-proxy-0                   1/1     Running     0               3h16m   10.104.18.25    4am-node25   <none>           <none>
fouramf-concurrent-not-found-2-pulsarv3-proxy-1                   1/1     Running     0               3h16m   10.104.9.73     4am-node14   <none>           <none>
fouramf-concurrent-not-found-2-pulsarv3-pulsar-init-854mx         0/1     Completed   0               3h16m   10.104.9.72     4am-node14   <none>           <none>
fouramf-concurrent-not-found-2-pulsarv3-recovery-0                1/1     Running     0               3h16m   10.104.9.70     4am-node14   <none>           <none>
fouramf-concurrent-not-found-2-pulsarv3-zookeeper-0               1/1     Running     0               3h16m   10.104.30.160   4am-node38   <none>           <none>
fouramf-concurrent-not-found-2-pulsarv3-zookeeper-1               1/1     Running     0               3h16m   10.104.18.35    4am-node25   <none>           <none>
fouramf-concurrent-not-found-2-pulsarv3-zookeeper-2               1/1     Running     0               3h16m   10.104.15.44    4am-node20   <none>           <none>

client log:
截屏2024-12-06 14 51 26

test steps:

        concurrent test and calculation of RT and QPS

        :purpose:  `DQL & DML(partition)`
            verify concurrent DQL & DML(partition) scenario,
            which has 4 vector fields(IVF_FLAT, HNSW, DISKANN, IVF_SQ8) and scalar fields: `int64_1`, `varchar_1`

        :test steps:
            1. create collection with fields:
                'float_vector': 128dim,
                'float_vector_1': 128dim,
                'float_vector_2': 128dim,
                'float_vector_3': 128dim,
                scalar field: int64_1, varchar_1
            2. build indexes:
                IVF_FLAT: 'float_vector'
                HNSW: 'float_vector_1',
                DISKANN: 'float_vector_2'
                IVF_SQ8: 'float_vector_3'
                INVERTED: 'int64_1', 'varchar_1'
                default scalar index: 'id'
            3. insert 1 million data into 10 partitions
            4. flush collection
            5. build indexes again using the same params
            6. load collection
                replica: 1
            7. concurrent request:
                - scene_test_partition
                    (partition: create->insert->flush->index again->load->search->release->search failed->drop)
                - search
                - hybrid_search
                - query

@wangting0128
Copy link
Contributor Author

same error test cases:

  • test_hybrid_search_locust_dql_dml_partition_hybrid_search_cluster
  • test_hybrid_search_locust_dql_dml_partition_cluster
  • test_bitmap_locust_dql_dml_partitions_cluster
  • test_bitmap_locust_dml_partitions_standalone
  • test_bitmap_locust_hybrid_index_cluster

congqixia added a commit to congqixia/milvus that referenced this issue Dec 6, 2024
Related to milvus-io#38275

Make rootcoord describe collection execute without scheduler lock in
order to remove deadlock introduced when sync partition and lock segment
describe collection

Signed-off-by: Congqi Xia <[email protected]>
@congqixia
Copy link
Contributor

this issue was caused by logic deadlock of CreatePartition(SyncPartition to QueryCoord) and load segmenting (describe collection)

trying to solve this problem by move describe collection out of lock

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 7, 2024
@yanliang567 yanliang567 removed their assignment Dec 7, 2024
@xiaofan-luan
Copy link
Collaborator

this issue was caused by logic deadlock of CreatePartition(SyncPartition to QueryCoord) and load segmenting (describe collection)

trying to solve this problem by move describe collection out of lock

The fix seems to be not working very well.

Right now, to keep proxy cache consistent, all describe collection from proxy need to wait the DDL lock, you can not skip it.

could you explain why the dead lock happened?

congqixia added a commit to congqixia/milvus that referenced this issue Dec 9, 2024
Related to milvus-io#38275

This PR move sync created partition step to proxy to avoid potential
logic deadlock when create partition happens with target segment change.

Signed-off-by: Congqi Xia <[email protected]>
@congqixia
Copy link
Contributor

sync_create_part_proxy
after some offline discussion, describe collection cannot be out of scheduler right now
the current solution is to move SyncCreatedPartition step to proxy to break this dependency loop as graph above

sre-ci-robot pushed a commit that referenced this issue Dec 10, 2024
Related to #38275

This PR move sync created partition step to proxy to avoid potential
logic deadlock when create partition happens with target segment change.

Signed-off-by: Congqi Xia <[email protected]>
@wangting0128
Copy link
Contributor Author

verification passed

argo task:fouramf-bitmap-scenes-tw86w
image: master-20241210-7ea9c983-amd64

argo task: fouramf-concurrent-xcq
image: master-20241210-fec31fed-amd64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants