Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: limit=10,but only 3 record is returned #38307

Closed
1 task done
Royhuiy opened this issue Dec 9, 2024 · 18 comments
Closed
1 task done

[Bug]: limit=10,but only 3 record is returned #38307

Royhuiy opened this issue Dec 9, 2024 · 18 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@Royhuiy
Copy link

Royhuiy commented Dec 9, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.4.17
- Deployment mode(standalone or cluster):standalone 
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus==2.5.0
- OS(Ubuntu or CentOS): ubuntu
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

使用bge-base-zh-v1.5对txt\docx\pdf实现embedding后。嵌入到milvus中。collection构建:
document_fields = [
FieldSchema(name="paragraph", dtype=DataType.INT64),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim = 768),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=2096),
FieldSchema(name="text_type", dtype=DataType.VARCHAR, max_length=2096),
FieldSchema(name="document_path", dtype=DataType.VARCHAR, is_primary=True, max_length=2096),
]

schema = CollectionSchema(document_fields )
collection = Collection(collection_name, schema)

创建索引以加快搜索速度

index_params = {"metric_type": "COSINE", "index_type": "IVF_FLAT", "params": {"nlist": 1024}}
collection.create_index("embedding", index_params)

paragraph:段落,embedding:向量,text:文本,text_type:文本类型(txt\docx\pdf),document_path:文档路径。

实现search的脚本:
def search_document_by_text(self, query_text:str = "", top_k:int = 10):
try:

        collection = self.milvus.create_collection(db_name=self.db_name,
                                                   if_exist=True, collection_name="document_collection")
        text_features = self.embed_model.encode(query_text, convert_to_tensor = True).to(self.device)
        query_vector = text_features.cpu().numpy()
        print(query_vector)
        search_params = {"metric_type": "COSINE", "params": {"nprobe": 10}}
        results = collection.search([query_vector], "embedding", search_params, limit = top_k,
                                         output_fields=["paragraph", "text", "document_path"])

        print("length of search results:", len(results[0]))

        return results
    except Exception as e:
        print(f"文档搜索错误: {e}")
        return []

limit=10,但只返回3条记录。

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

@Royhuiy Royhuiy added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 9, 2024
@yanliang567
Copy link
Contributor

@Royhuiy how many entities did you insert into milvus? Please refer this doc to export the whole Milvus logs for investigation.

/assign @Royhuiy
/unassign

@sre-ci-robot sre-ci-robot assigned Royhuiy and unassigned yanliang567 Dec 9, 2024
@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 9, 2024
Copy link
Contributor

github-actions bot commented Dec 9, 2024

The title and description of this issue contains Chinese. Please use English to describe your issue.

@Royhuiy
Copy link
Author

Royhuiy commented Dec 9, 2024

@Royhuiy how many entities did you insert into milvus? Please refer this doc to export the whole Milvus logs for investigation.

/assign @Royhuiy /unassign

image

@xiaofan-luan xiaofan-luan changed the title [Bug]: limit=10,但只返回3条记录 [Bug]: limit=10,but only 3 record is returned Dec 9, 2024
@xiaofan-luan
Copy link
Collaborator

  1. do you have filters when you query?
  2. is there any duplicated data in your field? please show your data samples if possible.

@Royhuiy
Copy link
Author

Royhuiy commented Dec 10, 2024

  1. do you have filters when you query?
  2. is there any duplicated data in your field? please show your data samples if possible.

please download the documents(datas) from link : https://pan.quark.cn/s/391003bb76af

@cydrain
Copy link
Contributor

cydrain commented Dec 10, 2024

Hi @Royhuiy ,

You set limit = 10, but only get 3 results, the most possible reason is there are too less entities in each bucket on average.

I see your total entity count is 1234, and search with 'nlist = 1024, nprobe = 10'.
I suggest to try 'nlist = 1024, nprobe = 128', or 'nlist = 32, nprobe = 4'.
If Milvus still returns less than 10 results, please let me know.

@cydrain
Copy link
Contributor

cydrain commented Dec 10, 2024

/assign

@Royhuiy
Copy link
Author

Royhuiy commented Dec 10, 2024

Hi @Royhuiy ,

You set limit = 10, but only get 3 results, the most possible reason is there are too less entities in each bucket on average.

I see your total entity count is 1234, and search with 'nlist = 1024, nprobe = 10'. I suggest to try 'nlist = 1024, nprobe = 128', or 'nlist = 32, nprobe = 4'. If Milvus still returns less than 10 results, please let me know.

i set nlist = 32 , nprobe=4 , limit=10, but only 2 record is returned.
could you show me the code how to split and embedding?
thanks~

@cydrain
Copy link
Contributor

cydrain commented Dec 10, 2024

Hi @Royhuiy ,
You set limit = 10, but only get 3 results, the most possible reason is there are too less entities in each bucket on average.
I see your total entity count is 1234, and search with 'nlist = 1024, nprobe = 10'. I suggest to try 'nlist = 1024, nprobe = 128', or 'nlist = 32, nprobe = 4'. If Milvus still returns less than 10 results, please let me know.

i set nlist = 32 , nprobe=4 , limit=10, but only 2 record is returned. could you show me the code how to split and embedding? thanks~

Hi @Royhuiy ,

'nlist' is a parameter to build IVF_FLAT index, if you change nlist to 32, you need rebuild your index.
Since your previous nlist is 1024, can you try to set 'nprobe = 128' and search again ?

And, I cannot understand what do you mean "could you show me the code how to split and embedding"

@Royhuiy
Copy link
Author

Royhuiy commented Dec 10, 2024

Hi @Royhuiy ,
You set limit = 10, but only get 3 results, the most possible reason is there are too less entities in each bucket on average.
I see your total entity count is 1234, and search with 'nlist = 1024, nprobe = 10'. I suggest to try 'nlist = 1024, nprobe = 128', or 'nlist = 32, nprobe = 4'. If Milvus still returns less than 10 results, please let me know.

i set nlist = 32 , nprobe=4 , limit=10, but only 2 record is returned. could you show me the code how to split and embedding? thanks~

Hi @Royhuiy ,

'nlist' is a parameter to build IVF_FLAT index, if you change nlist to 32, you need rebuild your index. Since your previous nlist is 1024, can you try to set 'nprobe = 128' and search again ?

And, I cannot understand what do you mean "could you show me the code how to split and embedding"

1、if I set nlist = 32, which type of the index should be seted?
2、i have tried set nlist = 1024 and nprobe = 128, but still returned less than 10 results

@Royhuiy
Copy link
Author

Royhuiy commented Dec 10, 2024

Hi @Royhuiy ,
You set limit = 10, but only get 3 results, the most possible reason is there are too less entities in each bucket on average.
I see your total entity count is 1234, and search with 'nlist = 1024, nprobe = 10'. I suggest to try 'nlist = 1024, nprobe = 128', or 'nlist = 32, nprobe = 4'. If Milvus still returns less than 10 results, please let me know.

i set nlist = 32 , nprobe=4 , limit=10, but only 2 record is returned. could you show me the code how to split and embedding? thanks~

Hi @Royhuiy ,

'nlist' is a parameter to build IVF_FLAT index, if you change nlist to 32, you need rebuild your index. Since your previous nlist is 1024, can you try to set 'nprobe = 128' and search again ?

And, I cannot understand what do you mean "could you show me the code how to split and embedding"

the new index was seted
image

@cydrain
Copy link
Contributor

cydrain commented Dec 10, 2024

Hi @Royhuiy ,

I see "nlist = 32" now, if you search with "nprobe = 8/16/32", can you get different result count ?

@Royhuiy
Copy link
Author

Royhuiy commented Dec 10, 2024

Hi @Royhuiy ,
You set limit = 10, but only get 3 results, the most possible reason is there are too less entities in each bucket on average.
I see your total entity count is 1234, and search with 'nlist = 1024, nprobe = 10'. I suggest to try 'nlist = 1024, nprobe = 128', or 'nlist = 32, nprobe = 4'. If Milvus still returns less than 10 results, please let me know.

i set nlist = 32 , nprobe=4 , limit=10, but only 2 record is returned. could you show me the code how to split and embedding? thanks~

Hi @Royhuiy ,
'nlist' is a parameter to build IVF_FLAT index, if you change nlist to 32, you need rebuild your index. Since your previous nlist is 1024, can you try to set 'nprobe = 128' and search again ?
And, I cannot understand what do you mean "could you show me the code how to split and embedding"

1、if I set nlist = 32, which type of the index should be seted? 2、i have tried set nlist = 1024 and nprobe = 128, but still returned less than 10 results

Hi @Royhuiy ,

I see "nlist = 32" now, if you search with "nprobe = 8/16/32", can you get different result count ?

when I set query = "猫", always 3 results is returned nprobe = 8/16/32.
i think the reason: the score cannot reach the score threshold that cannot be returned?

@cydrain
Copy link
Contributor

cydrain commented Dec 10, 2024

Hi @Royhuiy ,

You use "document path" as the primary key ?
if so, this is the problem.

You can use "auto_id" in this doc "https://milvus.io/docs/primary-field.md"

@Royhuiy
Copy link
Author

Royhuiy commented Dec 10, 2024

Hi @Royhuiy ,

You use "document path" as the primary key ? if so, this is the problem.

You can use "auto_id" in this doc "https://milvus.io/docs/primary-field.md"

you are right.The primary key is duplicated

thanks for your help~

@cydrain
Copy link
Contributor

cydrain commented Dec 10, 2024

Welcome to use Milvus more and raise more issues :P

@cydrain
Copy link
Contributor

cydrain commented Dec 10, 2024

/close

@sre-ci-robot
Copy link
Contributor

@cydrain: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

5 participants