Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat[frontend]: implement artifact-repositories configmap support #11354

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

droctothorpe
Copy link
Contributor

@droctothorpe droctothorpe commented Nov 5, 2024

Description of your changes:

Argo Workflows supports the ability to specify an artifact repository for archived logs uniquely for each namespace rather than just globally as documented here.

This PR adds support for this functionality to the KFP UI. If a workflow runs and...

  1. The corresponding pods are deleted
  2. The corresponding workflow is deleted
  3. There is an artifact-repositories configmap in the target namespace that provides a unique, namespaced keyFormat

...this PR ensures that logs are still accessible in the UI. This functionality is disabled by default to avoid unnecessary network calls (for end users who don't leverage namespace-specific artifact-repositories). It can be toggled on through configs.ts or an environment variable.

Co-authored with @quinnovator 🌟 .

Fixes: #11339

Checklist:

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zijianjoy for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: droctothorpe <[email protected]>
Co-authored-by: quinnovator <[email protected]>
@github-actions github-actions bot added ci-passed All CI tests on a pull request have passed and removed ci-passed All CI tests on a pull request have passed labels Nov 5, 2024
@github-actions github-actions bot added the ci-passed All CI tests on a pull request have passed label Nov 5, 2024
@HumairAK HumairAK self-assigned this Nov 6, 2024
@droctothorpe
Copy link
Contributor Author

droctothorpe commented Nov 6, 2024

@HumairAK to recreate the edge case:

  1. Deploy KFP to a local K8s cluster.
  2. Set ARGO_ARCHIVE_LOGS=true in configs.ts.
  3. cd into the frontend directory.
  4. Run NAMESPACE=kubeflow npm run start:proxy-and-server to start the UI on your host machine.
  5. Create the following configmap in the kubeflow namespace:
metadata:
  annotations:
    workflows.argoproj.io/default-artifact-repository: artifact-repositories
  name: artifact-repositories
  namespace: kubeflow
data:
  artifact-repositories: |-
    archiveLogs: true
    s3:
      accessKeySecret:
        key: accesskey
        name: mlpipeline-minio-artifact
      bucket: mlpipeline
      endpoint: minio-service.kubeflow:9000
      insecure: true
      keyFormat: foo
      secretKeySecret:
        key: secretkey
        name: mlpipeline-minio-artifact
kind: ConfigMap
  1. Run a simple pipeline.
  2. Once it completes, delete the AWF workflow custom resource associated with the run.
  3. Try to access the logs of the run in the UI running on your host machine. On master, this operation will fail. On this branch, it will succeed.
  4. The next few steps are optional. Port-forward the minio pod.
  5. Navigate to the minio UI.
  6. Login (username: minio, password: minio123).
  7. Note how the logs are stored in mlpipeline/foo rather than the default keyFormat specified in configs.ts. Argo grants precedence to the keyFormat specified in the artifact-repositories configmap over the keyFormat specified in workflow-controller-configmap.

@droctothorpe
Copy link
Contributor Author

droctothorpe commented Nov 8, 2024

I updated the instructions to reproduce to edge case to make manual validation a lot easier by running the UI on your host machine.

@HumairAK
Copy link
Collaborator

HumairAK commented Nov 13, 2024

Hey @droctothorpe /@quinnovator, thanks for this. I'm trying this out now and encountering some hiccups. I suspect it might be my environment/setup, but I want to confirm with you first.

  1. When buildling the image and trying this out with a pod deployment of the UI, I'm not seeing this working, what I see:

image

the UI yaml is as posted below:

deployment.yaml
kind: Deployment
apiVersion: apps/v1
metadata:
  annotations:
    deployment.kubernetes.io/revision: '3'
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"labels":{"app":"ml-pipeline-ui","application-crd-id":"kubeflow-pipelines"},"name":"ml-pipeline-ui","namespace":"kubeflow"},"spec":{"selector":{"matchLabels":{"app":"ml-pipeline-ui","application-crd-id":"kubeflow-pipelines"}},"template":{"metadata":{"annotations":{"cluster-autoscaler.kubernetes.io/safe-to-evict":"true"},"labels":{"app":"ml-pipeline-ui","application-crd-id":"kubeflow-pipelines"}},"spec":{"containers":[{"env":[{"name":"DISABLE_GKE_METADATA","value":"true"},{"name":"VIEWER_TENSORBOARD_POD_TEMPLATE_SPEC_PATH","value":"/etc/config/viewer-pod-template.json"},{"name":"MINIO_NAMESPACE","valueFrom":{"fieldRef":{"fieldPath":"metadata.namespace"}}},{"name":"MINIO_ACCESS_KEY","valueFrom":{"secretKeyRef":{"key":"accesskey","name":"mlpipeline-minio-artifact"}}},{"name":"MINIO_SECRET_KEY","valueFrom":{"secretKeyRef":{"key":"secretkey","name":"mlpipeline-minio-artifact"}}},{"name":"ALLOW_CUSTOM_VISUALIZATIONS","value":"true"},{"name":"FRONTEND_SERVER_NAMESPACE","valueFrom":{"fieldRef":{"fieldPath":"metadata.namespace"}}}],"image":"gcr.io/ml-pipeline/frontend:2.3.0","imagePullPolicy":"IfNotPresent","livenessProbe":{"exec":{"command":["wget","-q","-S","-O","-","http://localhost:3000/apis/v1beta1/healthz"]},"initialDelaySeconds":3,"periodSeconds":5,"timeoutSeconds":2},"name":"ml-pipeline-ui","ports":[{"containerPort":3000}],"readinessProbe":{"exec":{"command":["wget","-q","-S","-O","-","http://localhost:3000/apis/v1beta1/healthz"]},"initialDelaySeconds":3,"periodSeconds":5,"timeoutSeconds":2},"resources":{"requests":{"cpu":"10m","memory":"70Mi"}},"volumeMounts":[{"mountPath":"/etc/config","name":"config-volume","readOnly":true}]}],"serviceAccountName":"ml-pipeline-ui","volumes":[{"configMap":{"name":"ml-pipeline-ui-configmap"},"name":"config-volume"}]}}}}
  resourceVersion: '627852'
  name: ml-pipeline-ui
  uid: 4d454d73-05f6-4b01-b744-8a4c9aec05ec
  creationTimestamp: '2024-11-12T22:37:37Z'
  generation: 8
  namespace: kubeflow
  labels:
    app: ml-pipeline-ui
    application-crd-id: kubeflow-pipelines
spec:
  replicas: 0
  selector:
    matchLabels:
      app: ml-pipeline-ui
      application-crd-id: kubeflow-pipelines
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: ml-pipeline-ui
        application-crd-id: kubeflow-pipelines
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: 'true'
    spec:
      restartPolicy: Always
      serviceAccountName: ml-pipeline-ui
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30
      securityContext: {}
      containers:
        - resources:
            requests:
              cpu: 10m
              memory: 70Mi
          readinessProbe:
            exec:
              command:
                - wget
                - '-q'
                - '-S'
                - '-O'
                - '-'
                - 'http://localhost:3000/apis/v1beta1/healthz'
            initialDelaySeconds: 3
            timeoutSeconds: 2
            periodSeconds: 5
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          name: ml-pipeline-ui
          livenessProbe:
            exec:
              command:
                - wget
                - '-q'
                - '-S'
                - '-O'
                - '-'
                - 'http://localhost:3000/apis/v1beta1/healthz'
            initialDelaySeconds: 3
            timeoutSeconds: 2
            periodSeconds: 5
            successThreshold: 1
            failureThreshold: 3
          env:
            - name: DISABLE_GKE_METADATA
              value: 'true'
            - name: VIEWER_TENSORBOARD_POD_TEMPLATE_SPEC_PATH
              value: /etc/config/viewer-pod-template.json
            - name: MINIO_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
            - name: MINIO_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: mlpipeline-minio-artifact
                  key: accesskey
            - name: MINIO_SECRET_KEY
              valueFrom:
                secretKeyRef:
                  name: mlpipeline-minio-artifact
                  key: secretkey
            - name: ALLOW_CUSTOM_VISUALIZATIONS
              value: 'true'
            - name: FRONTEND_SERVER_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
            - name: ARGO_ARTIFACT_REPOSITORIES_LOOKUP
              value: 'true'
            - name: ARGO_ARCHIVE_LOGS
              value: 'true'
          ports:
            - containerPort: 3000
              protocol: TCP
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: config-volume
              readOnly: true
              mountPath: /etc/config
          terminationMessagePolicy: File
          image: 'quay.io/hukhan/ds-pipelines-frontend:11354'
      serviceAccount: ml-pipeline-ui
      volumes:
        - name: config-volume
          configMap:
            name: ml-pipeline-ui-configmap
            defaultMode: 420
      dnsPolicy: ClusterFirst
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  revisionHistoryLimit: 10
  progressDeadlineSeconds: 600
status:
  observedGeneration: 8
  conditions:
    - type: Progressing
      status: 'True'
      lastUpdateTime: '2024-11-13T04:36:07Z'
      lastTransitionTime: '2024-11-12T22:37:37Z'
      reason: NewReplicaSetAvailable
      message: ReplicaSet "ml-pipeline-ui-5c5479dc67" has successfully progressed.
    - type: Available
      status: 'True'
      lastUpdateTime: '2024-11-13T22:54:29Z'
      lastTransitionTime: '2024-11-13T22:54:29Z'
      reason: MinimumReplicasAvailable
      message: Deployment has minimum availability.

And the artifact repository configmap:

kind: ConfigMap
apiVersion: v1
metadata:
  name: artifact-repositories
  namespace: kubeflow
data:
  artifact-repositories: |-
    archiveLogs: true
    s3:
      accessKeySecret:
        key: accesskey
        name: mlpipeline-minio-artifact
      bucket: mlpipeline
      endpoint: minio-service.kubeflow:9000
      insecure: true
      keyFormat: custom_location/{{workflow.name}}/{{workflow.creationTimestamp.Y}}/{{workflow.creationTimestamp.m}}/{{workflow.creationTimestamp.d}}/{{pod.name}}
      secretKeySecret:
        key: secretkey
        name: mlpipeline-minio-artifact

As you can see I have set ARGO_ARTIFACT_REPOSITORIES_LOOKUP and ARGO_ARCHIVE_LOGS to true. I can also confirm it's got the PR changes by viewing the UI pod container logs, the UI config has:

{
argo: {
archiveArtifactory: 'minio',
archiveBucketName: 'mlpipeline',
archiveLogs: true,
keyFormat: 'artifacts/{{workflow.name}}/{{workflow.creationTimestamp.Y}}/{{workflow.creationTimestamp.m}}/{{workflow.creationTimestamp.d}}/{{pod.name}}',
artifactRepositoriesLookup: true      

^^^^^ PR change
  1. When trying this locally, I see some strange behavior where one of the task nodes is fetching the logs from object store, but not the other task. In some scenarios both fail to fetch it. My suspicion is there is some weird timeout behavior that's faulty with connecting to object store on my side, but wondering if you encounter this behavior at all:

logs-not-pulled

The pipeline used:

pipeline.yaml
# PIPELINE DEFINITION
# Name: tutorial-data-passing-2
# Inputs:
#    message: str [Default: 'message']
components:
  comp-preprocess:
    executorLabel: exec-preprocess
    inputDefinitions:
      parameters:
        message:
          parameterType: STRING
    outputDefinitions:
      artifacts:
        output_dataset_one:
          artifactType:
            schemaTitle: system.Dataset
            schemaVersion: 0.0.1
        output_dataset_two_path:
          artifactType:
            schemaTitle: system.Dataset
            schemaVersion: 0.0.1
      parameters:
        output_bool_parameter_path:
          parameterType: BOOLEAN
        output_dict_parameter_path:
          parameterType: STRUCT
        output_list_parameter_path:
          parameterType: LIST
        output_parameter_path:
          parameterType: STRING
  comp-train:
    executorLabel: exec-train
    inputDefinitions:
      artifacts:
        dataset_one_path:
          artifactType:
            schemaTitle: system.Dataset
            schemaVersion: 0.0.1
        dataset_two:
          artifactType:
            schemaTitle: system.Dataset
            schemaVersion: 0.0.1
      parameters:
        input_bool:
          parameterType: BOOLEAN
        input_dict:
          parameterType: STRUCT
        input_list:
          parameterType: LIST
        message:
          parameterType: STRING
        num_steps:
          defaultValue: 100.0
          isOptional: true
          parameterType: NUMBER_INTEGER
    outputDefinitions:
      artifacts:
        model:
          artifactType:
            schemaTitle: system.Model
            schemaVersion: 0.0.1
deploymentSpec:
  executors:
    exec-preprocess:
      container:
        args:
        - --executor_input
        - '{{$}}'
        - --function_to_execute
        - preprocess
        command:
        - sh
        - -c
        - "\nif ! [ -x \"$(command -v pip)\" ]; then\n    python3 -m ensurepip ||\
          \ python3 -m ensurepip --user || apt-get install python3-pip\nfi\n\nPIP_DISABLE_PIP_VERSION_CHECK=1\
          \ python3 -m pip install --quiet --no-warn-script-location 'kfp==2.3.0'\
          \ '--no-deps' 'typing-extensions>=3.7.4,<5; python_version<\"3.9\"' && \"\
          $0\" \"$@\"\n"
        - sh
        - -ec
        - 'program_path=$(mktemp -d)


          printf "%s" "$0" > "$program_path/ephemeral_component.py"

          _KFP_RUNTIME=true python3 -m kfp.dsl.executor_main                         --component_module_path                         "$program_path/ephemeral_component.py"                         "$@"

          '
        - "\nimport kfp\nfrom kfp import dsl\nfrom kfp.dsl import *\nfrom typing import\
          \ *\n\ndef preprocess(\n    # An input parameter of type string.\n    message:\
          \ str,\n    # Use Output[T] to get a metadata-rich handle to the output\
          \ artifact\n    # of type `Dataset`.\n    output_dataset_one: Output[Dataset],\n\
          \    # A locally accessible filepath for another output artifact of type\n\
          \    # `Dataset`.\n    output_dataset_two_path: OutputPath('Dataset'),\n\
          \    # A locally accessible filepath for an output parameter of type string.\n\
          \    output_parameter_path: OutputPath(str),\n    # A locally accessible\
          \ filepath for an output parameter of type bool.\n    output_bool_parameter_path:\
          \ OutputPath(bool),\n    # A locally accessible filepath for an output parameter\
          \ of type dict.\n    output_dict_parameter_path: OutputPath(Dict[str, int]),\n\
          \    # A locally accessible filepath for an output parameter of type list.\n\
          \    output_list_parameter_path: OutputPath(List[str]),\n):\n    \"\"\"\
          Dummy preprocessing step.\"\"\"\n\n    # Use Dataset.path to access a local\
          \ file path for writing.\n    # One can also use Dataset.uri to access the\
          \ actual URI file path.\n    with open(output_dataset_one.path, 'w') as\
          \ f:\n        f.write(message)\n\n    # OutputPath is used to just pass\
          \ the local file path of the output artifact\n    # to the function.\n \
          \   with open(output_dataset_two_path, 'w') as f:\n        f.write(message)\n\
          \n    with open(output_parameter_path, 'w') as f:\n        f.write(message)\n\
          \n    with open(output_bool_parameter_path, 'w') as f:\n        f.write(\n\
          \            str(True))  # use either `str()` or `json.dumps()` for bool\
          \ values.\n\n    import json\n    with open(output_dict_parameter_path,\
          \ 'w') as f:\n        f.write(json.dumps({'A': 1, 'B': 2}))\n\n    with\
          \ open(output_list_parameter_path, 'w') as f:\n        f.write(json.dumps(['a',\
          \ 'b', 'c']))\n\n"
        image: quay.io/opendatahub/ds-pipelines-ci-executor-image:v1.0
    exec-train:
      container:
        args:
        - --executor_input
        - '{{$}}'
        - --function_to_execute
        - train
        command:
        - sh
        - -c
        - "\nif ! [ -x \"$(command -v pip)\" ]; then\n    python3 -m ensurepip ||\
          \ python3 -m ensurepip --user || apt-get install python3-pip\nfi\n\nPIP_DISABLE_PIP_VERSION_CHECK=1\
          \ python3 -m pip install --quiet --no-warn-script-location 'kfp==2.3.0'\
          \ '--no-deps' 'typing-extensions>=3.7.4,<5; python_version<\"3.9\"' && \"\
          $0\" \"$@\"\n"
        - sh
        - -ec
        - 'program_path=$(mktemp -d)


          printf "%s" "$0" > "$program_path/ephemeral_component.py"

          _KFP_RUNTIME=true python3 -m kfp.dsl.executor_main                         --component_module_path                         "$program_path/ephemeral_component.py"                         "$@"

          '
        - "\nimport kfp\nfrom kfp import dsl\nfrom kfp.dsl import *\nfrom typing import\
          \ *\n\ndef train(\n    # Use InputPath to get a locally accessible path\
          \ for the input artifact\n    # of type `Dataset`.\n    dataset_one_path:\
          \ InputPath('Dataset'),\n    # Use Input[T] to get a metadata-rich handle\
          \ to the input artifact\n    # of type `Dataset`.\n    dataset_two: Input[Dataset],\n\
          \    # An input parameter of type string.\n    message: str,\n    # Use\
          \ Output[T] to get a metadata-rich handle to the output artifact\n    #\
          \ of type `Model`.\n    model: Output[Model],\n    # An input parameter\
          \ of type bool.\n    input_bool: bool,\n    # An input parameter of type\
          \ dict.\n    input_dict: Dict[str, int],\n    # An input parameter of type\
          \ List[str].\n    input_list: List[str],\n    # An input parameter of type\
          \ int with a default value.\n    num_steps: int = 100,\n):\n    \"\"\"Dummy\
          \ Training step.\"\"\"\n    with open(dataset_one_path, 'r') as input_file:\n\
          \        dataset_one_contents = input_file.read()\n\n    with open(dataset_two.path,\
          \ 'r') as input_file:\n        dataset_two_contents = input_file.read()\n\
          \    print(\"test3\")\n    line = (f'dataset_one_contents: {dataset_one_contents}\
          \ || '\n            f'dataset_two_contents: {dataset_two_contents} || '\n\
          \            f'message: {message} || '\n            f'input_bool: {input_bool},\
          \ type {type(input_bool)} || '\n            f'input_dict: {input_dict},\
          \ type {type(input_dict)} || '\n            f'input_list: {input_list},\
          \ type {type(input_list)} \\n')\n\n    with open(model.path, 'w') as output_file:\n\
          \        for i in range(num_steps):\n            output_file.write('Step\
          \ {}\\n{}\\n=====\\n'.format(i, line))\n\n    # model is an instance of\
          \ Model artifact, which has a .metadata dictionary\n    # to store arbitrary\
          \ metadata for the output artifact.\n    model.metadata['accuracy'] = 0.9\n\
          \n"
        image: quay.io/opendatahub/ds-pipelines-ci-executor-image:v1.0
pipelineInfo:
  name: tutorial-data-passing-2
root:
  dag:
    tasks:
      preprocess:
        cachingOptions:
          enableCache: true
        componentRef:
          name: comp-preprocess
        inputs:
          parameters:
            message:
              componentInputParameter: message
        taskInfo:
          name: preprocess
      train:
        cachingOptions:
          enableCache: true
        componentRef:
          name: comp-train
        dependentTasks:
        - preprocess
        inputs:
          artifacts:
            dataset_one_path:
              taskOutputArtifact:
                outputArtifactKey: output_dataset_one
                producerTask: preprocess
            dataset_two:
              taskOutputArtifact:
                outputArtifactKey: output_dataset_two_path
                producerTask: preprocess
          parameters:
            input_bool:
              taskOutputParameter:
                outputParameterKey: output_bool_parameter_path
                producerTask: preprocess
            input_dict:
              taskOutputParameter:
                outputParameterKey: output_dict_parameter_path
                producerTask: preprocess
            input_list:
              taskOutputParameter:
                outputParameterKey: output_list_parameter_path
                producerTask: preprocess
            message:
              taskOutputParameter:
                outputParameterKey: output_parameter_path
                producerTask: preprocess
        taskInfo:
          name: train
  inputDefinitions:
    parameters:
      message:
        defaultValue: message
        isOptional: true
        parameterType: STRING
schemaVersion: 2.1.0
sdkVersion: kfp-2.3.0

So question for you:

  1. Did you have this working with a built image and running in a pod? If so, I suspect I must have something configured incorrectly then.
  2. Does the above pipeline work fine for you when you juggle between either task's pod logs (after deleting the underlying WF) ? (or any other pipeline with more than one task node with logs)

@droctothorpe
Copy link
Contributor Author

Thank you so much for taking the time to validate this, @HumairAK, and for being so thorough with documenting the issue that you encountered.

'm going to try a couple of things to see if I can repro your results:

  1. Running the UI on host (using NAMESPACE=kubeflow npm run start:proxy-and-server), validate that I can toggle between multiple tasks.
  2. Running the UI on host, use your pipeline.
  3. Same as above but also use your configmap.
  4. Same as above but build and deploy the UI image instead of running it on the host.

Results:

  1. Running the UI on host, validate that I can toggle between multiple tasks.
    Worked.
image image
  1. Running the UI on host, use your pipeline.
    Worked.
image image
  1. Same as above but also use your configmap.

I think I see the problem. I did a comparison of the two configmaps:

image

Your configmap is missing the workflows.argoproj.io/default-artifact-repository: artifact-repositories annotation. From the AWF documentation:
image

Can you run the same test but make sure to include that annotation in your configmap?

Thank you! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-passed All CI tests on a pull request have passed size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[frontend] The UI does not account for the artifact-repositories configmap
2 participants