Skip to content

feat: changes for Bare metal Ai tier release#79

Draft
kupratyu-splunk wants to merge 14 commits intomainfrom
ai-tier-v2
Draft

feat: changes for Bare metal Ai tier release#79
kupratyu-splunk wants to merge 14 commits intomainfrom
ai-tier-v2

Conversation

@kupratyu-splunk
Copy link
Collaborator

Description

Related Issues

  • Related to #

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test improvement
  • CI/CD improvement
  • Chore (dependency updates, etc.)

Changes Made

Testing Performed

  • Unit tests pass (make test)
  • Linting passes (make lint)
  • Integration tests pass (if applicable)
  • E2E tests pass (if applicable)
  • Manual testing performed

Test Environment

  • Kubernetes Version:
  • Cloud Provider:
  • Deployment Method:

Test Steps

Documentation

  • Updated inline code comments
  • Updated README.md (if adding features)
  • Updated API documentation
  • Updated deployment guides
  • Updated CHANGELOG.md
  • No documentation needed

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published
  • I have updated the Helm chart version (if applicable)
  • I have updated CRD schemas (if applicable)

Breaking Changes

Impact:

Migration Path:

Screenshots/Recordings

Additional Notes

Reviewer Notes

Please pay special attention to:


Commit Message Convention: This PR follows Conventional Commits

Version 3.0.0 does not exist in the splunk helm repo; 3.1.0 is the
latest available. Also regenerates Chart.lock with correct digest.
The splunkai_models_apps package no longer exists in ai-platform-models.
The ray applications are now resolved relative to their working_dir zip,
so import paths should be bare module names (main:SERVE_APP / main:create_serve_app).
…ersion into ApplicationParams

Without working_dir, Ray has no zip to load main from and fails with
'No module named main'. Added WorkingDirBase and ModelVersion fields to
ApplicationParams, computed from object storage path and MODEL_VERSION
env var, and templated working_dir into all 13 app entries in applications.yaml.
…b_storage

Two bugs causing NoSuchBucket when Ray downloads working_dir zips:

1. rayS3DownloadEnv() was missing AWS_S3_ADDRESSING_STYLE=path. Boto3
   defaults to virtual-hosted style (bucket.endpoint) for custom endpoints,
   which fails DNS resolution with MinIO. Path-style (endpoint/bucket/key)
   is required for all S3-compatible stores.

2. applications.yaml used 'object_storage' as the model_loader sub-field but
   ModelLoader in model_definition.py defines it as 'blob_storage' (renamed
   in commit e62d93da). Pydantic silently ignored the unknown key, leaving
   blob_storage=None and causing a model validation error at startup.
…handler

Ray's s3:// protocol handler (protocol.py _handle_s3_protocol) creates a
plain boto3.Session().client('s3') with no endpoint_url, so it always hits
AWS S3 regardless of AWS_ENDPOINT_URL set on the pod. This causes NoSuchBucket
when the bucket only exists in MinIO.

Replace rayRuntimeWorkingDirScheme() with rayWorkingDirBase() which, for
S3-compatible stores with a custom endpoint, builds the working_dir as a
direct HTTP URL to MinIO (endpoint/bucket/path). Ray's https handler uses
urllib which simply fetches the URL without any S3-specific boto3 logic.

Also remove the ineffective AWS_S3_ADDRESSING_STYLE env var added in the
previous commit.
…nIO zips

Ray's s3:// protocol handler creates a bare boto3.Session().client('s3')
with no endpoint_url, so it always hits AWS S3 regardless of any custom
endpoint config. Rather than fighting Ray internals, switch to file://
working_dir pointing to app source baked into the Ray image.

- applications.yaml: replace all 'minio-zip' working_dir templates with
  file:///home/ray/ray/applications/entrypoint (Entrypoint) and
  file:///home/ray/ray/applications/generic_application (all other apps)
- builder.go: remove WorkingDirBase, ModelVersion fields and rayWorkingDirBase()
  function — no longer needed since working_dir is a static file:// path
- builder_test.go: remove TestRayWorkingDirBase test for deleted function
…ote URL for others

PromptInjectionTfidf, PromptInjectionCrossEncoder, PromptInjectionClassifier are
baked into the Ray worker image at /home/ray/ray/applications/generic_application,
so they use file:// working_dir with no network dependency.

All other apps (UaeLarge, AllMinilmL6V2, BiEncoder, MbartTranslator, etc.) continue
to use {{.WorkingDirBase}}/AppName-{{.ModelVersion}}.zip resolved at runtime from
the configured object storage (s3, gs, azure, or s3compat/MinIO endpoint).
…cile

- saia/impl.go: bump default memory request 1Gi->2Gi, limits CPU 1->2 / memory 2Gi->4Gi
  to prevent kubelet OOMKill during SAIA startup
- reconciler.go: preserve existing AIService Resources on reconcile so user-set limits
  are not wiped back to defaults on every AIPlatform reconcile
Ray requires file:// working_dir URIs to point to a .zip or .whl file.
Update the 3 prompt injection apps to reference generic_application.zip
which is built during the Docker image build in ai-platform-models.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant