[Bug]: [batch_insert fail] milvus down after I insert 10M random vector #38618

xwt1 · 2024-12-20T07:01:54Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: 2.4.17 or 2.5.x
- Deployment mode(standalone or cluster): standalone with docker compose
- MQ type(rocksmq, pulsar or kafka):    rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.3.1 
- OS(Ubuntu or CentOS): Ubuntu 22.04.4 LTS
- CPU/Memory: Intel Xeon E5-2678 v3 CPU (2.50 GHz) with 128 GB memory 
- GPU: 
- Others:

Current Behavior

When I want to insert 10M random vectors into standalone milvus and index it, it will randomly breakdown.

I basiclly generate 10M random vector and insert them into a collection, then index them. But when i execuate the code, the standalone instance breakdown.

The code I ran is like:

from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection
import numpy as np
import time
import sys

connections.connect(alias="default", host="127.0.0.1", port="19530") 

num_vectors = 10000000  
vector_dim = 128      
vectors = np.random.random((num_vectors, vector_dim)).astype(np.float32)
ids = [i for i in range(num_vectors)]

id_field = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False)
vector_field = FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=vector_dim)
schema = CollectionSchema(
    fields=[id_field, vector_field],
    description="Collection with HNSW index variations"
)

hnsw_configs = [
    {"index_type": "HNSW", "params": {"M": 8, "efConstruction": 100}, "metric_type": "L2"},
    {"index_type": "HNSW", "params": {"M": 16, "efConstruction": 200}, "metric_type": "L2"},
    {"index_type": "HNSW", "params": {"M": 32, "efConstruction": 500}, "metric_type": "L2"},
    {"index_type": "HNSW", "params": {"M": 48, "efConstruction": 600}, "metric_type": "L2"}
]

for i, index_config in enumerate(hnsw_configs):
    collection_name = f"hnsw_dataset_{i}"
    
    collection = Collection(name=collection_name, schema=schema)
    

    batch_size = 10000
    for start_idx in range(0, len(ids), batch_size):
        end_idx = min(start_idx + batch_size, len(ids))
        batch_ids = ids[start_idx:end_idx]
        batch_vectors = vectors[start_idx:end_idx]
        collection.insert([batch_ids, batch_vectors])
        sys.stdout.flush()
    

    sys.stdout.flush()
    start_time = time.time()
    collection.create_index(field_name="embedding", index_params=index_config, index_name=f"index_{i}")
    end_time = time.time()
    index_time = end_time - start_time
    sys.stdout.flush()

    collection.flush()
    
    collection.release()

I don't know whether the reason is that i do not allocate enough resource to etcd. I print the system milvus.log(in error mode) and see something like "failed to save by batch" "error="etcdserver: request timed out"". If it is , how can i fix it?

Expected Behavior

No Breakdown

Steps To Reproduce

Milvus Log

standalone-0.txt

Anything else?

No response

The text was updated successfully, but these errors were encountered:

yanliang567 · 2024-12-20T08:51:09Z

@xwt1 please check

the etcd service is running against SSD volume to get high performance
how many cpu cores did you set for etcd and milvus pods?
please upload etcd logs for investigation. For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.

/assign @xwt1
/unassign

xwt1 · 2024-12-20T10:10:08Z

@xwt1 please check

the etcd service is running against SSD volume to get high performance

how many cpu cores did you set for etcd and milvus pods?

please upload etcd logs for investigation. For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.

/assign @xwt1 /unassign

@yanliang567
Thanks for your reply!

For the point one, I actually run milvus all in a HDD. Does the etcd have to run in SSD or it will not work in HDD? I simply shrink the vector number to 1M or less and my code works.
For the point two, I don't know how to set different cpu cores in docker-compose.yml or milvus.yaml. So I think it is by default(I don't know the default value for etcd and I can't find information through milvus docs -_-)?
Here is the full log for etcd:
full_log_include_etcd.txt

BTW, I don't really care for performance now, which means working on HDD and have slow performance is acceptance. :(

xwt1 · 2024-12-24T07:08:11Z

@yanliang567

version: '3.5'

services:
  etcd:
    container_name: milvus-etcd
    image: quay.io/coreos/etcd:v3.5.5
    environment:
      - ETCD_AUTO_COMPACTION_MODE=revision
      - ETCD_AUTO_COMPACTION_RETENTION=1000
      - ETCD_QUOTA_BACKEND_BYTES=4294967296
      - ETCD_SNAPSHOT_COUNT=50000
    volumes:
      - /docker-home/xwt/CAVDTUNER/milvus-standalone-test-2.4.x/logs/etcd_log:/docker-home/xwt/CAVDTUNER/milvus-standalone-test-2.4.x/logs/etcd_log
      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd
    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
    healthcheck:
      test: ["CMD", "etcdctl", "endpoint", "health"]
      interval: 30s
      timeout: 20s
      retries: 3
    deploy:
      resources:
        limits:
          memory: 110G  
          cpus: '16.0'  
        reservations:
          memory: 32G  
          cpus: '8.0'  

  minio:
    container_name: milvus-minio
    image: minio/minio:RELEASE.2023-03-20T20-16-18Z
    environment:
      MINIO_ACCESS_KEY: minioadmin
      MINIO_SECRET_KEY: minioadmin
    ports:
      - "9001:9001"
      - "9000:9000"
    volumes:
      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data
    command: minio server /minio_data --console-address ":9001"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 30s
      timeout: 20s
      retries: 3

  standalone:
    container_name: milvus-standalone
    image: milvusdb/milvus:v2.4.17
    command: ["milvus", "run", "standalone"]
    security_opt:
    - seccomp:unconfined
    environment:
      ETCD_ENDPOINTS: etcd:2379
      MINIO_ADDRESS: minio:9000
    volumes:
      - /docker-home/xwt/CAVDTUNER/milvus-standalone-test-2.4.x/milvus.yaml:/milvus/configs/milvus.yaml
      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
      - /docker-home/xwt/CAVDTUNER/milvus-standalone-test-2.4.x/logs/milvus_log:/docker-home/xwt/CAVDTUNER/milvus-standalone-test-2.4.x/logs/milvus_log
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
      interval: 30s
      start_period: 90s
      timeout: 20s
      retries: 3
    ports:
      - "19530:19530"
      - "9091:9091"
    depends_on:
      - "etcd"
      - "minio"

networks:
  default:
    name: milvus

I try to set the cpu core to 8 and memory to 32G and get the same result.
I print the etcd log and the standalone log(in error mode):
standalone-0.txt
full_log_include_etcd.txt

yanliang567 · 2024-12-24T07:23:32Z

The etcd service must run on SSD volumes, because milvus uses etcd as the heart-beating service. If etcd is slow, milvus fails to keep healthy and restarts itself.
According to the etcd logs, I think it is too slow as the msg indiciates.
milvus-etcd | {"level":"warn","ts":"2024-12-24T02:46:45.659Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"3.466976856s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"health\" ","response":"range_response_count:0 size:5"}

xwt1 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 20, 2024

xwt1 assigned yanliang567 Dec 20, 2024

sre-ci-robot assigned xwt1 and unassigned yanliang567 Dec 20, 2024

yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 20, 2024

xwt1 closed this as completed Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: [batch_insert fail] milvus down after I insert 10M random vector #38618

[Bug]: [batch_insert fail] milvus down after I insert 10M random vector #38618

xwt1 commented Dec 20, 2024 •

edited

Loading

yanliang567 commented Dec 20, 2024

xwt1 commented Dec 20, 2024 •

edited

Loading

xwt1 commented Dec 24, 2024 •

edited

Loading

yanliang567 commented Dec 24, 2024

[Bug]: [batch_insert fail] milvus down after I insert 10M random vector #38618

[Bug]: [batch_insert fail] milvus down after I insert 10M random vector #38618

Comments

xwt1 commented Dec 20, 2024 • edited Loading

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

yanliang567 commented Dec 20, 2024

xwt1 commented Dec 20, 2024 • edited Loading

xwt1 commented Dec 24, 2024 • edited Loading

yanliang567 commented Dec 24, 2024

xwt1 commented Dec 20, 2024 •

edited

Loading

xwt1 commented Dec 20, 2024 •

edited

Loading

xwt1 commented Dec 24, 2024 •

edited

Loading