Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [batch_insert fail] milvus down after I insert 10M random vector #38618

Closed
1 task done
xwt1 opened this issue Dec 20, 2024 · 4 comments
Closed
1 task done

[Bug]: [batch_insert fail] milvus down after I insert 10M random vector #38618

xwt1 opened this issue Dec 20, 2024 · 4 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@xwt1
Copy link

xwt1 commented Dec 20, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4.17 or 2.5.x
- Deployment mode(standalone or cluster): standalone with docker compose
- MQ type(rocksmq, pulsar or kafka):    rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.3.1 
- OS(Ubuntu or CentOS): Ubuntu 22.04.4 LTS
- CPU/Memory: Intel Xeon E5-2678 v3 CPU (2.50 GHz) with 128 GB memory 
- GPU: 
- Others:

Current Behavior

When I want to insert 10M random vectors into standalone milvus and index it, it will randomly breakdown.

I basiclly generate 10M random vector and insert them into a collection, then index them. But when i execuate the code, the standalone instance breakdown.

The code I ran is like:

from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection
import numpy as np
import time
import sys

connections.connect(alias="default", host="127.0.0.1", port="19530") 

num_vectors = 10000000  
vector_dim = 128      
vectors = np.random.random((num_vectors, vector_dim)).astype(np.float32)
ids = [i for i in range(num_vectors)]

id_field = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False)
vector_field = FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=vector_dim)
schema = CollectionSchema(
    fields=[id_field, vector_field],
    description="Collection with HNSW index variations"
)

hnsw_configs = [
    {"index_type": "HNSW", "params": {"M": 8, "efConstruction": 100}, "metric_type": "L2"},
    {"index_type": "HNSW", "params": {"M": 16, "efConstruction": 200}, "metric_type": "L2"},
    {"index_type": "HNSW", "params": {"M": 32, "efConstruction": 500}, "metric_type": "L2"},
    {"index_type": "HNSW", "params": {"M": 48, "efConstruction": 600}, "metric_type": "L2"}
]

for i, index_config in enumerate(hnsw_configs):
    collection_name = f"hnsw_dataset_{i}"
    
    collection = Collection(name=collection_name, schema=schema)
    

    batch_size = 10000
    for start_idx in range(0, len(ids), batch_size):
        end_idx = min(start_idx + batch_size, len(ids))
        batch_ids = ids[start_idx:end_idx]
        batch_vectors = vectors[start_idx:end_idx]
        collection.insert([batch_ids, batch_vectors])
        sys.stdout.flush()
    

    sys.stdout.flush()
    start_time = time.time()
    collection.create_index(field_name="embedding", index_params=index_config, index_name=f"index_{i}")
    end_time = time.time()
    index_time = end_time - start_time
    sys.stdout.flush()

    collection.flush()
    
    collection.release()

I don't know whether the reason is that i do not allocate enough resource to etcd. I print the system milvus.log(in error mode) and see something like "failed to save by batch" "error="etcdserver: request timed out"". If it is , how can i fix it?

Expected Behavior

No Breakdown

Steps To Reproduce

Milvus Log

standalone-0.txt

Anything else?

No response

@xwt1 xwt1 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 20, 2024
@yanliang567
Copy link
Contributor

@xwt1 please check

  1. the etcd service is running against SSD volume to get high performance
  2. how many cpu cores did you set for etcd and milvus pods?
  3. please upload etcd logs for investigation. For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.

/assign @xwt1
/unassign

@sre-ci-robot sre-ci-robot assigned xwt1 and unassigned yanliang567 Dec 20, 2024
@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 20, 2024
@xwt1
Copy link
Author

xwt1 commented Dec 20, 2024

@xwt1 please check

  1. the etcd service is running against SSD volume to get high performance
  2. how many cpu cores did you set for etcd and milvus pods?
  3. please upload etcd logs for investigation. For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.

/assign @xwt1 /unassign

@yanliang567
Thanks for your reply!

  1. For the point one, I actually run milvus all in a HDD. Does the etcd have to run in SSD or it will not work in HDD? I simply shrink the vector number to 1M or less and my code works.
  2. For the point two, I don't know how to set different cpu cores in docker-compose.yml or milvus.yaml. So I think it is by default(I don't know the default value for etcd and I can't find information through milvus docs -_-)?
  3. Here is the full log for etcd:
    full_log_include_etcd.txt

BTW, I don't really care for performance now, which means working on HDD and have slow performance is acceptance. :(

@xwt1
Copy link
Author

xwt1 commented Dec 24, 2024

@yanliang567

version: '3.5'

services:
  etcd:
    container_name: milvus-etcd
    image: quay.io/coreos/etcd:v3.5.5
    environment:
      - ETCD_AUTO_COMPACTION_MODE=revision
      - ETCD_AUTO_COMPACTION_RETENTION=1000
      - ETCD_QUOTA_BACKEND_BYTES=4294967296
      - ETCD_SNAPSHOT_COUNT=50000
    volumes:
      - /docker-home/xwt/CAVDTUNER/milvus-standalone-test-2.4.x/logs/etcd_log:/docker-home/xwt/CAVDTUNER/milvus-standalone-test-2.4.x/logs/etcd_log
      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd
    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
    healthcheck:
      test: ["CMD", "etcdctl", "endpoint", "health"]
      interval: 30s
      timeout: 20s
      retries: 3
    deploy:
      resources:
        limits:
          memory: 110G  
          cpus: '16.0'  
        reservations:
          memory: 32G  
          cpus: '8.0'  

  minio:
    container_name: milvus-minio
    image: minio/minio:RELEASE.2023-03-20T20-16-18Z
    environment:
      MINIO_ACCESS_KEY: minioadmin
      MINIO_SECRET_KEY: minioadmin
    ports:
      - "9001:9001"
      - "9000:9000"
    volumes:
      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data
    command: minio server /minio_data --console-address ":9001"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 30s
      timeout: 20s
      retries: 3

  standalone:
    container_name: milvus-standalone
    image: milvusdb/milvus:v2.4.17
    command: ["milvus", "run", "standalone"]
    security_opt:
    - seccomp:unconfined
    environment:
      ETCD_ENDPOINTS: etcd:2379
      MINIO_ADDRESS: minio:9000
    volumes:
      - /docker-home/xwt/CAVDTUNER/milvus-standalone-test-2.4.x/milvus.yaml:/milvus/configs/milvus.yaml
      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
      - /docker-home/xwt/CAVDTUNER/milvus-standalone-test-2.4.x/logs/milvus_log:/docker-home/xwt/CAVDTUNER/milvus-standalone-test-2.4.x/logs/milvus_log
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
      interval: 30s
      start_period: 90s
      timeout: 20s
      retries: 3
    ports:
      - "19530:19530"
      - "9091:9091"
    depends_on:
      - "etcd"
      - "minio"

networks:
  default:
    name: milvus

I try to set the cpu core to 8 and memory to 32G and get the same result.
I print the etcd log and the standalone log(in error mode):
standalone-0.txt
full_log_include_etcd.txt

@yanliang567
Copy link
Contributor

The etcd service must run on SSD volumes, because milvus uses etcd as the heart-beating service. If etcd is slow, milvus fails to keep healthy and restarts itself.
According to the etcd logs, I think it is too slow as the msg indiciates.
milvus-etcd | {"level":"warn","ts":"2024-12-24T02:46:45.659Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"3.466976856s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"health\" ","response":"range_response_count:0 size:5"}

@xwt1 xwt1 closed this as completed Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

2 participants