You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I set S3 artifact store in my stack, python run.py. After 5 min, it failed with: ClientError: An error occurred (400) when calling the HeadObject operation: Bad Request
I checked Stackoverflow; people said the above error could be caused by token expiry.
Then I check zenml codes and the underlying s3fs codes. It seems that s3fs has a 5-minute timeout default for connections. connect_timeout = 5
I checked my MinIO trace, the window is truely 5 min.
I am considering adding that timeout parameter to the zenml config, but dont know how and where.
Since the error occurred after a long-duration fine-tuning process, I wonder:
Maybe S3ArtifactStore should auto re-connect? Instead of throwing out the 400 error.
Therefore this bug report.
Contact Details [Optional]
No response
System Information
ZENML_LOCAL_VERSION: 0.61.0
ZENML_SERVER_VERSION: 0.61.0
ZENML_SERVER_DATABASE: sqlite
ZENML_SERVER_DEPLOYMENT_TYPE: other
ZENML_CONFIG_DIR: /home/raymund/.config/zenml
ZENML_LOCAL_STORE_DIR: /home/raymund/.config/zenml/local_stores
ZENML_SERVER_URL: http://192.168.0.100:8080
ZENML_ACTIVE_REPOSITORY_ROOT: /home/raymund/Documents/ftune-hq
PYTHON_VERSION: 3.10.14
ENVIRONMENT: native
SYSTEM_INFO: {'os': 'linux', 'linux_distro': 'ubuntu', 'linux_distro_like': 'debian', 'linux_distro_version': '24.04'}
ACTIVE_WORKSPACE: default
ACTIVE_STACK: s3store_stack
ACTIVE_USER: lab
TELEMETRY_STATUS: enabled
ANALYTICS_CLIENT_ID: 5efcfb28-f5b0-44d7-be5f-7e06cf8909fc
ANALYTICS_USER_ID: 518c5278-9025-4d9d-87ec-a760547355c2
ANALYTICS_SERVER_ID: 37313583-62d5-4a91-8a12-209bbf468827
INTEGRATIONS: ['airflow', 'bitbucket', 'kaniko', 'pigeon', 'pytorch', 's3']
PACKAGES: {'certifi': '2024.7.4', 'fsspec': '2024.6.1', 's3fs': '2024.6.1', 'regex': '2024.5.15', 'tzdata': '2024.1', 'pytz': '2024.1',
'setuptools': '65.5.0', 'pip': '24.1.2', 'packaging': '24.1', 'attrs': '23.2.0', 'pyarrow': '16.1.0', 'rich': '13.7.1',
'nvidia-nvjitlink-cu12': '12.5.82', 'nvidia-cuda-nvrtc-cu12': '12.1.105', 'nvidia-cuda-cupti-cu12': '12.1.105', 'nvidia-nvtx-cu12':
'12.1.105', 'nvidia-cuda-runtime-cu12': '12.1.105', 'nvidia-cublas-cu12': '12.1.3.1', 'nvidia-cusparse-cu12': '12.1.0.106',
'nvidia-cusolver-cu12': '11.4.5.107', 'nvidia-cufft-cu12': '11.0.2.54', 'nvidia-curand-cu12': '10.3.2.106', 'ipython': '8.26.0',
'nvidia-cudnn-cu12': '8.9.2.26', 'ipywidgets': '8.1.3', 'click': '8.1.3', 'configparser': '7.0.0', 'docker': '6.1.3', 'multidict':
'6.0.5', 'pyyaml': '6.0.1', 'psutil': '6.0.0', 'traitlets': '5.14.3', 'decorator': '5.1.1', 'smmap': '5.0.1', 'tqdm': '4.66.4',
'transformers': '4.42.3', 'typing-extensions': '4.12.2', 'pexpect': '4.9.0', 'widgetsnbextension': '4.0.11', 'gitdb': '4.0.11',
'async-timeout': '4.0.3', 'bcrypt': '4.0.1', 'filelock': '3.15.4', 'aiohttp': '3.9.5', 'idna': '3.7', 'xxhash': '3.4.1',
'charset-normalizer': '3.3.2', 'networkx': '3.3', 'gitpython': '3.1.43', 'jinja2': '3.1.4', 'prompt-toolkit': '3.0.47',
'jupyterlab-widgets': '3.0.11', 'greenlet': '3.0.3', 'markdown-it-py': '3.0.0', 'requests': '2.31.0', 'nvidia-nccl-cu12': '2.20.5',
'datasets': '2.19.1', 'pydantic-core': '2.18.4', 'pygments': '2.18.0', 'aiobotocore': '2.13.1', 'python-dateutil': '2.9.0.post0',
'pydantic': '2.7.4', 'pyparsing': '2.4.7', 'asttokens': '2.4.1', 'torch': '2.3.1', 'triton': '2.3.1', 'pandas': '2.2.2', 'urllib3':
'2.2.2', 'pydantic-settings': '2.2.1', 'cloudpickle': '2.2.1', 'markupsafe': '2.1.5', 'sqlalchemy': '2.0.31', 'executing': '2.0.1',
'boto3': '1.34.131', 'botocore': '1.34.131', 'numpy': '1.26.4', 'wrapt': '1.16.0', 'six': '1.16.0', 'sympy': '1.12.1', 'yarl': '1.9.4',
'distro': '1.9.0', 'alembic': '1.8.1', 'websocket-client': '1.8.0', 'passlib': '1.7.4', 'frozenlist': '1.4.1', 'argparse': '1.4.0',
'mako': '1.3.5', 'aiosignal': '1.3.1', 'mpmath': '1.3.0', 'exceptiongroup': '1.2.1', 'pymysql': '1.1.1', 'python-dotenv': '1.0.1',
'jmespath': '1.0.1', 'multiprocess': '0.70.16', 'zenml': '0.61.0', 'bitsandbytes': '0.43.1', 'sqlalchemy-utils': '0.41.2', 'accelerate':
'0.32.1', 'huggingface-hub': '0.23.4', 'jedi': '0.19.1', 'httplib2': '0.19.1', 'tokenizers': '0.19.1', 'validators': '0.18.2', 'peft':
'0.11.1', 'aioitertools': '0.11.0', 's3transfer': '0.10.2', 'parso': '0.8.4', 'aws-profile-manager': '0.7.3', 'ptyprocess': '0.7.0',
'annotated-types': '0.7.0', 'stack-data': '0.6.3', 'pyarrow-hotfix': '0.6', 'safetensors': '0.4.3', 'dill': '0.3.8', 'secure': '0.3.0',
'click-params': '0.3.0', 'wcwidth': '0.2.13', 'pure-eval': '0.2.2', 'comm': '0.2.2', 'matplotlib-inline': '0.1.7', 'mdurl': '0.1.2',
'sqlmodel': '0.0.18'}
CURRENT STACK
Name: s3store_stack
ID: f5a870ff-2c25-4e63-aa5b-4b90acaa9c64
User: lab / 518c5278-9025-4d9d-87ec-a760547355c2
Workspace: default / ec65075d-7856-48f9-a5ea-10eb3f063bac
ORCHESTRATOR: default
Name: default
ID: 5b8cf94c-0231-4860-9b7b-71d701cf866d
Type: orchestrator
Flavor: local
Configuration: {}
Workspace: default / ec65075d-7856-48f9-a5ea-10eb3f063bac
ARTIFACT_STORE: s3_store
Name: s3_store
ID: bee9c341-3cc6-48dc-b764-21f63b1924ea
Type: artifact_store
Flavor: s3
Configuration: {'authentication_secret': 's3_secret', 'path': 's3://labyrinth', 'key': '', 'secret': '', 'token':
'********', 'client_kwargs': {'endpoint_url': 'http://192.168.0.100:9000', 'region_name': 'taipei'}, 'config_kwargs': None,
's3_additional_kwargs': None}
User: lab / 518c5278-9025-4d9d-87ec-a760547355c2
Workspace: default / ec65075d-7856-48f9-a5ea-10eb3f063bac
What happened?
I set S3 artifact store in my stack, python run.py. After 5 min, it failed with:
ClientError: An error occurred (400) when calling the HeadObject operation: Bad Request
I checked Stackoverflow; people said the above error could be caused by token expiry.
Then I check zenml codes and the underlying s3fs codes. It seems that s3fs has a 5-minute timeout default for connections.
connect_timeout = 5
I checked my MinIO trace, the window is truely 5 min.
I am considering adding that timeout parameter to the zenml config, but dont know how and where.
Since the error occurred after a long-duration fine-tuning process, I wonder:
Maybe S3ArtifactStore should auto re-connect? Instead of throwing out the 400 error.
Therefore this bug report.
Reproduction steps
zenml artifact-store register s3_store -f s3 --path='s3://labyrinth' --authentication_secret=s3_secret --client_kwargs='{"endpoint_url": "http://192.168.0.100:9000", "region_name": "taipei"}'
zenml stack register s3store_stack -a s3_store -o default --set
python run.py
Relevant log output
Code of Conduct
The text was updated successfully, but these errors were encountered: