[Bug]: The number of query count * is larger than the number of inserted entities during the major compaction #38373

binbinlv · 2024-12-11T08:52:23Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: 2.4-20241211-f4696a19-amd64
- Deployment mode(standalone or cluster): both
- MQ type(rocksmq, pulsar or kafka):    all 
- SDK version(e.g. pymilvus v2.0.0rc2): 2.5.0
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The number of query count * is larger than the number of inserted entities during the major compaction

1000000 -> 2000000

Expected Behavior

The number of query count * should be equal with the number of inserted entities during the major compaction

Steps To Reproduce

prepare data and major compaction

import os
import time
import random
import string
import numpy as np
from pymilvus import (
    connections,
    utility,
    FieldSchema, CollectionSchema, DataType,
    Collection,
)

fmt = "\n=== {:30} ===\n"
dim = 128

print(fmt.format("start connecting to Milvus"))
host = os.environ.get('MILVUS_HOST')
if host == None:
    host = ""
print(fmt.format(f"Milvus host: {host}"))
connections.connect()


default_fields = [
    FieldSchema(name="count", dtype=DataType.INT64, is_primary=True),
    FieldSchema(name="key", dtype=DataType.INT64, is_clustering_key=True),
    FieldSchema(name="random", dtype=DataType.DOUBLE),
    FieldSchema(name="var", dtype=DataType.VARCHAR, max_length=10000, is_primary=False),
    FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=dim)
]
default_schema = CollectionSchema(fields=default_fields, description="test clustering-key collection")
collection_name = "major_compaction_collection_enable_scalar_clustering_key"

if utility.has_collection(collection_name):
   collection = Collection(name=collection_name)
   collection.drop()
   print("drop the original collection")
hello_milvus = Collection(name=collection_name, schema=default_schema)

nb = 1000
rng = np.random.default_rng(seed=19530)
random_data = rng.random(nb).tolist()
vec_data = [[random.random() for _ in range(dim)] for _ in range(nb)]
_len = int(20)
_str = string.ascii_letters + string.digits
_s = _str
print("_str size ", len(_str))

for i in range(int(_len / len(_str))):
    _s += _str
    print("append str ", i)
values = [''.join(random.sample(_s, _len - 1)) for _ in range(nb)]
index = 0
while index < 1000:
    # insert data
    data = [
        [index * nb + i for i in range(nb)],
        [random.randint(0,1000) for i in range(nb)],
        random_data,
        values,
        vec_data,
    ]
    start = time.time()
    res = hello_milvus.insert(data)
    end = time.time() - start
    print("insert %d %d done in %f" % (index, nb, end))
    index += 1
    hello_milvus.flush()

print(f"Number of entities in Milvus: {hello_milvus.num_entities}")  # check the num_entites

# 4. create index
print(fmt.format("Start Creating index IVF_FLAT"))
index = {
    "index_type": "IVF_FLAT",
    "metric_type": "L2",
    "params": {"nlist": 100},
}

hello_milvus.create_index("embeddings", index)

hello_milvus.load()

print("Start major compaction")
hello_milvus.compact(is_clustering=True)

res = hello_milvus.get_compaction_state(is_clustering=True)
print(res)

print("waiting for compaction completed")
hello_milvus.wait_for_compaction_completed(is_clustering=True)


res = hello_milvus.get_compaction_state(is_clustering=True)
print(res)

query count *

import os
import time
import random
import string
import numpy as np
from pymilvus import (
    connections,
    utility,
    FieldSchema, CollectionSchema, DataType,
    Collection,
)

fmt = "\n=== {:30} ===\n"
dim = 128


print(fmt.format("start connecting to Milvus"))
host = os.environ.get('MILVUS_HOST')
if host == None:
    host = ""
print(fmt.format(f"Milvus host: {host}"))
connections.connect()


collection_name = "major_compaction_collection_enable_scalar_clustering_key"


hello_milvus = Collection(name=collection_name)

duration = 0

while duration >= 0:
   res = hello_milvus.query("count>=0", output_fields=["count(*)"])
   print(res[0]['count(*)'])
   assert res[0]['count(*)']==1000000
   duration = duration + 1
   time.sleep(1)

Milvus Log

No response

Anything else?

No response

binbinlv · 2024-12-11T08:52:34Z

/assign @xiaocai2333

binbinlv · 2024-12-11T08:52:42Z

/unassign @yanliang567

binbinlv · 2024-12-11T09:08:05Z

And after major compaction, the query count(*) is changed to the expected number after a period of time.

xiaocai2333 · 2024-12-11T09:41:51Z

This is because the segments generated by clustering compaction were not set the correct compactionFrom, resulting in GetRecoveryInfo retreving duplicate data.

) issue: #38373 master pr: #36799 This bug was introduced by PR #37653 . Signed-off-by: Cai Zhang <[email protected]>

binbinlv · 2024-12-12T04:20:23Z

Verified and fixed.

The number of the query count(*) keeps unchanged before/during/after the major compaction.

milvus: 2.4-20241212-e9598604-amd64
pymilvus: 2.5.0

binbinlv added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 11, 2024

binbinlv assigned yanliang567 Dec 11, 2024

sre-ci-robot assigned xiaocai2333 Dec 11, 2024

sre-ci-robot unassigned yanliang567 Dec 11, 2024

binbinlv added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 11, 2024

binbinlv added this to the 2.4.18 milestone Dec 11, 2024

xiaocai2333 mentioned this issue Dec 11, 2024

fix:[2.4]Set the correct compactionFroms for clustering segments #38376

Merged

sre-ci-robot pushed a commit that referenced this issue Dec 11, 2024

fix:[2.4]Set the correct compactionFroms for clustering segments (#38376

dde9d6c

) issue: #38373 master pr: #36799 This bug was introduced by PR #37653 . Signed-off-by: Cai Zhang <[email protected]>

binbinlv closed this as completed Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: The number of query count * is larger than the number of inserted entities during the major compaction #38373

[Bug]: The number of query count * is larger than the number of inserted entities during the major compaction #38373

binbinlv commented Dec 11, 2024 •

edited

Loading

binbinlv commented Dec 11, 2024

binbinlv commented Dec 11, 2024

binbinlv commented Dec 11, 2024 •

edited

Loading

xiaocai2333 commented Dec 11, 2024

binbinlv commented Dec 12, 2024

[Bug]: The number of query count * is larger than the number of inserted entities during the major compaction #38373

[Bug]: The number of query count * is larger than the number of inserted entities during the major compaction #38373

Comments

binbinlv commented Dec 11, 2024 • edited Loading

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

binbinlv commented Dec 11, 2024

binbinlv commented Dec 11, 2024

binbinlv commented Dec 11, 2024 • edited Loading

xiaocai2333 commented Dec 11, 2024

binbinlv commented Dec 12, 2024

binbinlv commented Dec 11, 2024 •

edited

Loading

binbinlv commented Dec 11, 2024 •

edited

Loading