Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: The number of query count * is larger than the number of inserted entities during the major compaction #38373

Closed
1 task done
binbinlv opened this issue Dec 11, 2024 · 5 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@binbinlv
Copy link
Contributor

binbinlv commented Dec 11, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4-20241211-f4696a19-amd64
- Deployment mode(standalone or cluster): both
- MQ type(rocksmq, pulsar or kafka):    all 
- SDK version(e.g. pymilvus v2.0.0rc2): 2.5.0
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The number of query count * is larger than the number of inserted entities during the major compaction

1000000 -> 2000000

Expected Behavior

The number of query count * should be equal with the number of inserted entities during the major compaction

Steps To Reproduce

  1. prepare data and major compaction
import os
import time
import random
import string
import numpy as np
from pymilvus import (
    connections,
    utility,
    FieldSchema, CollectionSchema, DataType,
    Collection,
)

fmt = "\n=== {:30} ===\n"
dim = 128

print(fmt.format("start connecting to Milvus"))
host = os.environ.get('MILVUS_HOST')
if host == None:
    host = ""
print(fmt.format(f"Milvus host: {host}"))
connections.connect()


default_fields = [
    FieldSchema(name="count", dtype=DataType.INT64, is_primary=True),
    FieldSchema(name="key", dtype=DataType.INT64, is_clustering_key=True),
    FieldSchema(name="random", dtype=DataType.DOUBLE),
    FieldSchema(name="var", dtype=DataType.VARCHAR, max_length=10000, is_primary=False),
    FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=dim)
]
default_schema = CollectionSchema(fields=default_fields, description="test clustering-key collection")
collection_name = "major_compaction_collection_enable_scalar_clustering_key"

if utility.has_collection(collection_name):
   collection = Collection(name=collection_name)
   collection.drop()
   print("drop the original collection")
hello_milvus = Collection(name=collection_name, schema=default_schema)

nb = 1000
rng = np.random.default_rng(seed=19530)
random_data = rng.random(nb).tolist()
vec_data = [[random.random() for _ in range(dim)] for _ in range(nb)]
_len = int(20)
_str = string.ascii_letters + string.digits
_s = _str
print("_str size ", len(_str))

for i in range(int(_len / len(_str))):
    _s += _str
    print("append str ", i)
values = [''.join(random.sample(_s, _len - 1)) for _ in range(nb)]
index = 0
while index < 1000:
    # insert data
    data = [
        [index * nb + i for i in range(nb)],
        [random.randint(0,1000) for i in range(nb)],
        random_data,
        values,
        vec_data,
    ]
    start = time.time()
    res = hello_milvus.insert(data)
    end = time.time() - start
    print("insert %d %d done in %f" % (index, nb, end))
    index += 1
    hello_milvus.flush()

print(f"Number of entities in Milvus: {hello_milvus.num_entities}")  # check the num_entites

# 4. create index
print(fmt.format("Start Creating index IVF_FLAT"))
index = {
    "index_type": "IVF_FLAT",
    "metric_type": "L2",
    "params": {"nlist": 100},
}

hello_milvus.create_index("embeddings", index)

hello_milvus.load()

print("Start major compaction")
hello_milvus.compact(is_clustering=True)

res = hello_milvus.get_compaction_state(is_clustering=True)
print(res)

print("waiting for compaction completed")
hello_milvus.wait_for_compaction_completed(is_clustering=True)


res = hello_milvus.get_compaction_state(is_clustering=True)
print(res)
  1. query count *
import os
import time
import random
import string
import numpy as np
from pymilvus import (
    connections,
    utility,
    FieldSchema, CollectionSchema, DataType,
    Collection,
)

fmt = "\n=== {:30} ===\n"
dim = 128


print(fmt.format("start connecting to Milvus"))
host = os.environ.get('MILVUS_HOST')
if host == None:
    host = ""
print(fmt.format(f"Milvus host: {host}"))
connections.connect()


collection_name = "major_compaction_collection_enable_scalar_clustering_key"


hello_milvus = Collection(name=collection_name)

duration = 0

while duration >= 0:
   res = hello_milvus.query("count>=0", output_fields=["count(*)"])
   print(res[0]['count(*)'])
   assert res[0]['count(*)']==1000000
   duration = duration + 1
   time.sleep(1)

Milvus Log

No response

Anything else?

No response

@binbinlv binbinlv added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 11, 2024
@binbinlv
Copy link
Contributor Author

/assign @xiaocai2333

@binbinlv
Copy link
Contributor Author

/unassign @yanliang567

@binbinlv binbinlv added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 11, 2024
@binbinlv binbinlv added this to the 2.4.18 milestone Dec 11, 2024
@binbinlv
Copy link
Contributor Author

binbinlv commented Dec 11, 2024

And after major compaction, the query count(*) is changed to the expected number after a period of time.

@xiaocai2333
Copy link
Contributor

This is because the segments generated by clustering compaction were not set the correct compactionFrom, resulting in GetRecoveryInfo retreving duplicate data.

sre-ci-robot pushed a commit that referenced this issue Dec 11, 2024
)

issue: #38373 
master pr: #36799 
This bug was introduced by PR #37653 .

Signed-off-by: Cai Zhang <[email protected]>
@binbinlv
Copy link
Contributor Author

Verified and fixed.

The number of the query count(*) keeps unchanged before/during/after the major compaction.

milvus: 2.4-20241212-e9598604-amd64
pymilvus: 2.5.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants