diff --git a/Research/kubeflow-on-azure-stack-lab/00-Intro/Readme.md b/Research/kubeflow-on-azure-stack-lab/00-Intro/Readme.md index 72d7ddf19..4cbaf4d68 100644 --- a/Research/kubeflow-on-azure-stack-lab/00-Intro/Readme.md +++ b/Research/kubeflow-on-azure-stack-lab/00-Intro/Readme.md @@ -74,7 +74,12 @@ The simpliest way to istall Kubeflow is to use a CNAP package. ## Step 1: Install Porter Make sure you have Porter installed. You can find the installation instructions for your OS at -Porter's [Installation Instructions](https://porter.sh/install/) +Porter's [Installation Instructions](https://porter.sh/install/). Latest version on Linux: + + $ curl https://cdn.porter.sh/latest/install-linux.sh | bash + Installing porter to /home/azureuser/.porter + Installed porter v0.29.0 (5e7240cf) + ... **NOTE:** be sure to add porter to your `PATH` variable so it can find the binaries @@ -82,27 +87,53 @@ Porter's [Installation Instructions](https://porter.sh/install/) First you will need to navigate to porter directory in the repository. For example + $ git clone https://github.com/Azure-Samples/azure-intelligent-edge-patterns.git + $ cd azure-intelligent-edge-patterns/Research/kubeflow-on-azure-stack/00-Intro $ cd porter/kubeflow -Change the file permissions +Change the file permissions if needed: - $ chmod 777 kubeflow.sh + $ chmod 755 kubeflow.sh -Next, you will build the porter CNAB +Build the porter CNAB like so: $ porter build + Copying porter runtime ===> + Copying mixins ===> + Copying mixin exec ===> + Copying mixin kubernetes ===> + Generating Dockerfile =======> + Writing Dockerfile =======> + Starting Invocation Image Build =======> + ## Step 3: Generate Credentials -This step is needed to connect to your Kubernetes cluster +This step is needed to connect to your Kubernetes cluster. An easy way to define the connection is to +point to the kubeconfig file. It is usually either in `/home/azureuser/.kube/config`, or you can find +and copy it from `/etc/kubernetes/admin.conf`. Here is the idiomatic way to do it: - $ porter credentials generate + $ mkdir -p $HOME/.kube + $ sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config + $ sudo chown $(id -u):$(id -g) $HOME/.kube/config -Enter path to your kubeconfig file when prompted eg on master node of my cluster i gave +Alternatively, you could use `KUBECONFIG` environment variable, if you are the root user, you can run: + + $ export KUBECONFIG=/etc/kubernetes/admin.conf + +To generate Porter credentials, pick the menu 'from file' and enter your path `/home/azureuser/.kube/config`: + + $ porter credentials generate + ? How would you like to set credential "kubeconfig" + file path + ? Enter the path that will be used to set credential "kubeconfig" + /home/azureuser/.kube/config Validate that your credential is present by running the below command. You should see something like the below output. $ porter credentials list + NAME MODIFIED + KubeflowInstaller 40 seconds ago ![List Porter Credentials](porter/kubeflow/pics/porter-credentials-validate.png) @@ -113,13 +144,26 @@ Run one of the below commands to interact with the CNAB To Install : $ porter install --cred KubeflowInstaller + installing KubeflowInstaller... + executing install action from KubeflowInstaller (installation: KubeflowInstaller) + Install Kubeflow + Installing Kubeflow + [INFO] Installing kftctl binary for Kubeflow CLI... + ... Creating directory to store download + ... Downloading kfctl binary + ./kfctl + ... Creating Kubeflow directory + ... Installing Kubeflow for deployment: sandboxASkf + [DEBUG] /root/kubeflow//kfctl apply -V -f https://raw.githubusercontent.com/kubeflow/manifests/v1.1-branch/kfdef/kfctl_k8s_istio.v1.1.0.yaml + ... + ... + execution completed successfully! -To Upgrade : +The pods will start being created, and it will take several minutes, depending on the performance of your system. + +If you watnt to upgrade or uninstall Porter packages, you can use similar commands(do NOT run them right now): $ porter upgrade --cred KubeflowInstaller - -To Uninstall : - $ porter uninstall --cred KubeflowInstaller ## Step 5: Check for pods and services @@ -129,6 +173,26 @@ After the installation each of the services gets installed into its own namespac $ kubectl get pods -n kubeflow $ kubectl get svc -n kubeflow +Or, use script we provide in `sbin` folder to check until all pods are in `Running` state(press `Ctrl-C` to stop the script +if no pods are in `ContainerCreating`/`Init`/`Error` states anymore): + + $ cd azure-intelligent-edge-patterns/Research/kubeflow-on-azure-stack-lab/sbin + $ ./check_status.sh + NAME READY STATUS RESTARTS AGE + cache-deployer-deployment-b75f5c5f6-97fsb 0/2 Error 0 6m24s + cache-server-85bccd99bd-bkvww 0/2 Init:0/1 0 6m24s + kfserving-controller-manager-0 0/2 ContainerCreating 0 6m8s + metadata-db-695fb6f55-l6dgs 0/1 ContainerCreating 0 6m23s + ml-pipeline-persistenceagent-6f99b56974-mnt8l 0/2 PodInitializing 0 6m21s + Press Ctrl-C to stop... + NAME READY STATUS RESTARTS AGE + cache-server-85bccd99bd-bkvww 0/2 Init:0/1 0 7m24s + metadata-grpc-deployment-9fdb476-kszzl 0/1 CrashLoopBackOff 5 7m22s + Press Ctrl-C to stop... + NAME READY STATUS RESTARTS AGE + ^C + + ### Step 6: Opening Kubeflow dashboard To access the dashboard using external connection, replace "type: NodePort" with "type: LoadBalancer" using the patch command: @@ -162,6 +226,12 @@ let the pods create containers and start. --- In case CNAB package installation does not work, you can do it maually, see [Installing Kubeflow manually](installing_kubeflow_manually.md). +You would need to run `kubeflow_install` script we provided, and follow the instructions. At your Kubernetes master node: + + $ git clone https://github.com/Azure-Samples/azure-intelligent-edge-patterns.git + $ cd azure-intelligent-edge-patterns/Research/kubeflow-on-azure-stack/sbin + $ chmod 755 *.sh + $ ./kubeflow_install.sh We prepared the instructions to [Uninstalling Kubeflow](uninstalling_kubeflow.md) too in case you need to so so. diff --git a/Research/kubeflow-on-azure-stack-lab/00-Intro/installing_aks-engine.md b/Research/kubeflow-on-azure-stack-lab/00-Intro/installing_aks-engine.md index 171fdece8..477333653 100644 --- a/Research/kubeflow-on-azure-stack-lab/00-Intro/installing_aks-engine.md +++ b/Research/kubeflow-on-azure-stack-lab/00-Intro/installing_aks-engine.md @@ -7,7 +7,7 @@ Download `aks-engone` installation script if you do not have it already: Run the installer, specifying its version: - $ ./get-akse.sh --version v0.43.0 + $ ./get-akse.sh --version v0.55.4 If you have problems, please refer to the official page: [Install the AKS engine on Linux in Azure Stack](https://docs.microsoft.com/en-us/azure-stack/user/azure-stack-kubernetes-aks-engine-deploy-linux). @@ -18,10 +18,15 @@ In the completely disconnected environment, you need to acquire the archive via Verify `aks-engine` version: $ aks-engine version - Version: v0.43.0 + Version: v0.55.4 GitCommit: 8928a4094 GitTreeState: clean +Copy the certificate file with the following command: + + $ sudo cp /var/lib/waagent/Certificates.pem /usr/local/share/ca-certificates/azurestackca.crt + $ sudo update-ca-certificates + # Links - [Azure/aks-engine](https://github.com/Azure/aks-engine) diff --git a/Research/kubeflow-on-azure-stack-lab/00-Intro/installing_kubeflow_manually.md b/Research/kubeflow-on-azure-stack-lab/00-Intro/installing_kubeflow_manually.md index 747d863fd..26846282c 100644 --- a/Research/kubeflow-on-azure-stack-lab/00-Intro/installing_kubeflow_manually.md +++ b/Research/kubeflow-on-azure-stack-lab/00-Intro/installing_kubeflow_manually.md @@ -17,8 +17,16 @@ the master node of your Kubernetes cluster: At your Kubernetes master node: $ git clone https://github.com/Azure-Samples/azure-intelligent-edge-patterns.git + +Make sure you cloned from the right repository and you are on the correct branch. + $ cd azure-intelligent-edge-patterns/Research/kubeflow-on-azure-stack/sbin +If for some reasons, the scripts are not executable(happens with cross-platform git commits), +update the file permissions: + + $ chmod 755 *.sh + **IMPORTANT:** **Do NOT stop the script until it finishes. Some Kubernetes errors and warnings are expected @@ -80,8 +88,15 @@ become `Running` and the list will be empty: When the pods have been created, you can proceed. -To start using Kubeflow, you may want to make Kubeflow Dashboard be visible, so you will need -to change the type of the ingress behavior - from `NodePort` to `LoadBalancer`, using this +**IMPORTANT:** +To open the dashboard to a public IP address, you should first implement a solution to prevent unauthorized access. You can read more about Azure authentication options from [Access Control for Azure Deployment](https://www.kubeflow.org/docs/azure/authentication/). + +For demo use, you can use port-forwarding to visit your cluster, run the following command and visit http://localhost:8080: + + $ kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80 + +Or, again, **for non-production deployment**, you can to make Kubeflow Dashboard be externally visible by +changing the type of the ingress behavior - from `NodePort` to `LoadBalancer`, using this command (default editor is vi, to edit you need to press `i`, and to save and exit, `:wq`): $ ./edit_external_access.sh diff --git a/Research/kubeflow-on-azure-stack-lab/00-Intro/installing_kubernetes.md b/Research/kubeflow-on-azure-stack-lab/00-Intro/installing_kubernetes.md index 5978317f5..1046a73b7 100644 --- a/Research/kubeflow-on-azure-stack-lab/00-Intro/installing_kubernetes.md +++ b/Research/kubeflow-on-azure-stack-lab/00-Intro/installing_kubernetes.md @@ -39,12 +39,12 @@ you will need to ask your cloud administrator. You need the following: Make sure you have all this information before proceeding further. -You can chose to create a Kubernetes object from your Portal, keeping -in mind the settings and adjustments we discuss below. +Even though you can create a Kubernetes object from your Portal, for Kubeflow we need to make a few +configuration changes, and it is easier to do with AKS-e. ***Please continue to the next chapterm and do not create a Kubernetes cluster using Portal.*** ![pics/creating_k8s_marketplace.png](pics/creating_k8s_marketplace.png) -# Installing Kubernetes using AKS-e (Skip the rest of the page if you did it using Portal) +# Installing Kubernetes using AKS-e ## Login to the desired cloud @@ -160,9 +160,28 @@ In our case we updated these fields: - "portalURL": "https://portal.demo2.stackpoc.com" - "dnsPrefix": "kube-rgDEMO2" - "keyData": "\" -- updated the `"orchestratorReleaseVersion"` from 1.15 to 1.15.5(which is among the listed supported versions) +- updated the `"orchestratorReleaseVersion"` with what is among the listed supported versions +- changed the master count from 3 to 1. And have 4 pool count. +- added "apiServerConfig" values to resolve istion-system token storage. -Let's also change the master count from 3 to 1. Here is the resulting `kube-rgDEMO2_demoe2.json`: +***Note that `apiServerConfig` may not be available from the template.*** Please make sure you have this definition in "kuberntetesconfig": +``` + "properties": { + ... + "orchestratorProfile": { + ... + "kubernetesConfig": { + ... + "apiServerConfig": { + "--service-account-api-audiences": "api,istio-ca", + "--service-account-issuer": "kubernetes.default.svc", + "--service-account-signing-key-file": "/etc/kubernetes/certs/apiserver.key" + } + ... + ... +``` + +Here is the resulting `kube-rgDEMO2_demoe2.json`: { "apiVersion": "vlabs", @@ -170,17 +189,17 @@ Let's also change the master count from 3 to 1. Here is the resulting `kube-rgDE "properties": { "orchestratorProfile": { "orchestratorType": "Kubernetes", - "orchestratorRelease": "1.15", + "orchestratorRelease": "1.17", + "orchestratorVersion": "1.17.11", "kubernetesConfig": { "cloudProviderBackoff": true, "cloudProviderBackoffRetries": 1, "cloudProviderBackoffDuration": 30, "cloudProviderRateLimit": true, - "cloudProviderRateLimitQPS": 3, - "cloudProviderRateLimitBucket": 10, - "cloudProviderRateLimitQPSWrite": 3, - "cloudProviderRateLimitBucketWrite": 10, - "kubernetesImageBase": "mcr.microsoft.com/k8s/azurestack/core/", + "cloudProviderRateLimitQPS": 100, + "cloudProviderRateLimitBucket": 150, + "cloudProviderRateLimitQPSWrite": 25, + "cloudProviderRateLimitBucketWrite": 30, "useInstanceMetadata": false, "networkPlugin": "kubenet", "kubeletConfig": { @@ -190,6 +209,11 @@ Let's also change the master count from 3 to 1. Here is the resulting `kube-rgDE "--node-monitor-grace-period": "5m", "--pod-eviction-timeout": "5m", "--route-reconciliation-period": "1m" + }, + "apiServerConfig": { + "--service-account-api-audiences": "api,istio-ca", + "--service-account-issuer": "kubernetes.default.svc", + "--service-account-signing-key-file": "/etc/kubernetes/certs/apiserver.key" } } }, @@ -209,7 +233,7 @@ Let's also change the master count from 3 to 1. Here is the resulting `kube-rgDE "agentPoolProfiles": [ { "name": "linuxpool", - "count": 3, + "count": 4, "vmSize": "Standard_F16", "distro": "aks-ubuntu-16.04", "availabilityProfile": "AvailabilitySet", @@ -241,11 +265,11 @@ see details in a separate page, [Installing aks-engine](installing_aks-engine.md Download `aks-engine` installation script: $ curl -o get-akse.sh https://raw.githubusercontent.com/Azure/aks-engine/master/scripts/get-akse.sh - $ chmod 700 get-akse.sh + $ chmod 755 get-akse.sh Run the installer, specifying its version: - $ ./get-akse.sh --version v0.43.0 + $ ./get-akse.sh --version v0.55.4 If you have problems, please refer to the official page: [Install the AKS engine on Linux in Azure Stack](https://docs.microsoft.com/en-us/azure-stack/user/azure-stack-kubernetes-aks-engine-deploy-linux). @@ -257,7 +281,7 @@ does have the connection, and uncompress it on the machine where you plan using Verify `aks-engine` version: $ aks-engine version - Version: v0.43.0 + Version: v0.55.4 GitCommit: 8928a4094 GitTreeState: clean @@ -344,7 +368,8 @@ environment. For this demo we will substitute `azurefile` with our own locally-mounted network storage. -Follow the steps in [Installing Storage](../01-Jupyter/installing_storage.md) to create a Persistent Volume Claim +Follow the steps in [Installing Storage](../01-Jupyter/installing_storage.md) +to create a Persistent Volume Claim that you could use in your Kubernetes deployments. For simplicity, we create a Samba server, but you are welcome to use nfs diff --git a/Research/kubeflow-on-azure-stack-lab/01-Jupyter/Readme.md b/Research/kubeflow-on-azure-stack-lab/01-Jupyter/Readme.md index d03bbd3f0..aa85008da 100644 --- a/Research/kubeflow-on-azure-stack-lab/01-Jupyter/Readme.md +++ b/Research/kubeflow-on-azure-stack-lab/01-Jupyter/Readme.md @@ -27,11 +27,14 @@ ML/NN and AI more broadly, are mathematical concepts and could be implemented in and frameworks. In this lab we will use mostly Python, but you are free to pick whatever you are comfortable with - many of the deployment options are language-agnostic as long as apis are satisfied. -## Tensorboard access +## (Optional) Tensorboard access There is another useful tool to monitor some ML applications if they support it. We provided a sample file to start it in your Kubernetes cluster, `tensorboard.yaml`. +**Pre-requisite**: You need persistent volume. Follow the steps in [Installing Storage](installing_storage.md) to create a Persistent Volume Claim +that you could use in your Kubernetes deployments. + To start Tensorboard running, deploy it using `kubectl`, and theck that the pod is up: $ kubectl create -f tensorboard.yaml @@ -56,7 +59,7 @@ Now you can access the port you forward from your Kubernetes environment: ## Tensorboard deployment -Here is how you would connect your Tensorboard with the persistence we discuss next: +Here is how you would connect your Tensorboard with the persistence: $ cat tb.yaml apiVersion: extensions/v1beta1 diff --git a/Research/kubeflow-on-azure-stack-lab/01-Jupyter/installing_storage.md b/Research/kubeflow-on-azure-stack-lab/01-Jupyter/installing_storage.md index b5b35e844..a48c29b55 100644 --- a/Research/kubeflow-on-azure-stack-lab/01-Jupyter/installing_storage.md +++ b/Research/kubeflow-on-azure-stack-lab/01-Jupyter/installing_storage.md @@ -20,6 +20,8 @@ to other available options on the cluster you are using. ## Creating smb clients Each node of our Kubernetes cluster has to have Samba client to access our Samba server. +Make sure network ports 137-139,445 are accessible on all nodes of your cluster. + You need to repeat the following on every vm in your Kubernetes cluster(you can get their local ip from the portal and ssh from the master node): diff --git a/Research/kubeflow-on-azure-stack-lab/01-Jupyter/working_with_tensorboard.md b/Research/kubeflow-on-azure-stack-lab/01-Jupyter/working_with_tensorboard.md index edc02b417..a7ee29280 100644 --- a/Research/kubeflow-on-azure-stack-lab/01-Jupyter/working_with_tensorboard.md +++ b/Research/kubeflow-on-azure-stack-lab/01-Jupyter/working_with_tensorboard.md @@ -4,6 +4,9 @@ Tensorboard is an application that helps visualizing data. It was built to visua TensorFlow, but could be used more broadly. For example, in our tutorial we demo how to use it for TensorFlow and Pytorch. +**Pre-requisite**: You need persistent volume. Follow the steps in [Installing Storage](installing_storage.md) to create a Persistent Volume Claim +that you could use in your Kubernetes deployments. + We could use a generic Tensorboard deplolyment, see `tb_generic.yaml`: $ kubectl create -f tb_generic.yaml diff --git a/Research/kubeflow-on-azure-stack-lab/02-TFJobs/Readme.md b/Research/kubeflow-on-azure-stack-lab/02-TFJobs/Readme.md index 6a311b870..12ec5da4a 100644 --- a/Research/kubeflow-on-azure-stack-lab/02-TFJobs/Readme.md +++ b/Research/kubeflow-on-azure-stack-lab/02-TFJobs/Readme.md @@ -216,7 +216,7 @@ Your updated .yaml should look something like: spec: containers: - name: tensorflow - image: kubeflow/tf-dist-mnist-test:1.0 + image: rollingstone/tf-dist-mnist-test:1.0 volumeMounts: - mountPath: "/tmp/mnist-data" name: samba-share-volume2 @@ -231,7 +231,7 @@ Your updated .yaml should look something like: spec: containers: - name: tensorflow - image: kubeflow/tf-dist-mnist-test:1.0 + image: rollingstone/tf-dist-mnist-test:1.0 volumeMounts: - mountPath: "/tmp/mnist-data" name: samba-share-volume2 @@ -258,7 +258,8 @@ For more tutorials and How-Tos, see TensorFlow's [save_and_load.ipynb](https://g ## Tensorboard -There is another useful tool to monitor some ML applications if they support it. We provided a sample file to start it in your Kubernetes cluster, `tensorboard.yaml`. +There is another useful tool to monitor some ML applications if they support it. We provided a sample file to start it in your Kubernetes cluster, `tensorboard.yaml`. For this exercise we have a separate one. +**Delete the old tensorboard instance if it is already running.** A concrete example of a tensorboard-using script is in folder `tf-mnist-w-tb`. You will need your github account to build the image(substitute `rollingstone` for yours) and run: diff --git a/Research/kubeflow-on-azure-stack-lab/02-TFJobs/dist-mnist-e2e-test/tf_job_mnist-e2e-test-with_persistence.yaml b/Research/kubeflow-on-azure-stack-lab/02-TFJobs/dist-mnist-e2e-test/tf_job_mnist-e2e-test-with_persistence.yaml index 5e9503375..ded74ace0 100644 --- a/Research/kubeflow-on-azure-stack-lab/02-TFJobs/dist-mnist-e2e-test/tf_job_mnist-e2e-test-with_persistence.yaml +++ b/Research/kubeflow-on-azure-stack-lab/02-TFJobs/dist-mnist-e2e-test/tf_job_mnist-e2e-test-with_persistence.yaml @@ -11,7 +11,7 @@ spec: spec: containers: - name: tensorflow - image: kubeflow/tf-dist-mnist-test:1.0 + image: rollingstone/tf-dist-mnist-test:1.0 volumeMounts: - mountPath: "/tmp/mnist-data" name: samba-share-volume2 @@ -26,7 +26,7 @@ spec: spec: containers: - name: tensorflow - image: kubeflow/tf-dist-mnist-test:1.0 + image: rollingstone/tf-dist-mnist-test:1.0 volumeMounts: - mountPath: "/tmp/mnist-data" name: samba-share-volume2 diff --git a/Research/kubeflow-on-azure-stack-lab/02-TFJobs/dist-mnist-e2e-test/tf_job_mnist-e2e-test.yaml b/Research/kubeflow-on-azure-stack-lab/02-TFJobs/dist-mnist-e2e-test/tf_job_mnist-e2e-test.yaml index 3a1df3357..e2a17e203 100644 --- a/Research/kubeflow-on-azure-stack-lab/02-TFJobs/dist-mnist-e2e-test/tf_job_mnist-e2e-test.yaml +++ b/Research/kubeflow-on-azure-stack-lab/02-TFJobs/dist-mnist-e2e-test/tf_job_mnist-e2e-test.yaml @@ -11,7 +11,7 @@ spec: spec: containers: - name: tensorflow - image: kubeflow/tf-dist-mnist-test:1.0 + image: rollingstone/tf-dist-mnist-test:1.0 Worker: replicas: 3 restartPolicy: OnFailure @@ -19,4 +19,4 @@ spec: spec: containers: - name: tensorflow - image: kubeflow/tf-dist-mnist-test:1.0 + image: rollingstone/tf-dist-mnist-test:1.0 diff --git a/Research/kubeflow-on-azure-stack-lab/02-TFJobs/mnist-w-tb/tb_tf.yaml b/Research/kubeflow-on-azure-stack-lab/02-TFJobs/mnist-w-tb/tb_tf.yaml index 3b7d54026..4e417e4df 100644 --- a/Research/kubeflow-on-azure-stack-lab/02-TFJobs/mnist-w-tb/tb_tf.yaml +++ b/Research/kubeflow-on-azure-stack-lab/02-TFJobs/mnist-w-tb/tb_tf.yaml @@ -1,39 +1,54 @@ -apiVersion: extensions/v1beta1 +apiVersion: apps/v1 kind: Deployment metadata: + creationTimestamp: null labels: app: tensorboard name: tensorboard spec: + progressDeadlineSeconds: 2147483647 replicas: 1 + revisionHistoryLimit: 2147483647 selector: matchLabels: app: tensorboard + strategy: + rollingUpdate: + maxSurge: 1 + maxUnavailable: 1 + type: RollingUpdate template: metadata: + creationTimestamp: null labels: app: tensorboard spec: - volumes: - - name: samba-share-volume2 - persistentVolumeClaim: - claimName: samba-share-claim containers: - - name: tensorboard - image: tensorflow/tensorflow:1.10.0 - imagePullPolicy: Always - command: - - /usr/local/bin/tensorboard - args: + - args: - --logdir - /tmp/tensorflow/logs - volumeMounts: - - mountPath: /tmp/tensorflow - subPath: tf-mnist-w-tb - name: samba-share-volume2 + command: + - /usr/local/bin/tensorboard + image: tensorflow/tensorflow:1.10.0 + imagePullPolicy: Always + name: tensorboard ports: - containerPort: 6006 protocol: TCP + resources: {} + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + volumeMounts: + - mountPath: /tmp/tensorflow + name: samba-share-volume2 + subPath: tf-mnist-w-tb dnsPolicy: ClusterFirst restartPolicy: Always - \ No newline at end of file + schedulerName: default-scheduler + securityContext: {} + terminationGracePeriodSeconds: 30 + volumes: + - name: samba-share-volume2 + persistentVolumeClaim: + claimName: samba-share-claim +status: {} diff --git a/Research/kubeflow-on-azure-stack-lab/02-TFJobs/mnist-w-tb/tb_tf_v1beta1.yaml b/Research/kubeflow-on-azure-stack-lab/02-TFJobs/mnist-w-tb/tb_tf_v1beta1.yaml new file mode 100644 index 000000000..3b7d54026 --- /dev/null +++ b/Research/kubeflow-on-azure-stack-lab/02-TFJobs/mnist-w-tb/tb_tf_v1beta1.yaml @@ -0,0 +1,39 @@ +apiVersion: extensions/v1beta1 +kind: Deployment +metadata: + labels: + app: tensorboard + name: tensorboard +spec: + replicas: 1 + selector: + matchLabels: + app: tensorboard + template: + metadata: + labels: + app: tensorboard + spec: + volumes: + - name: samba-share-volume2 + persistentVolumeClaim: + claimName: samba-share-claim + containers: + - name: tensorboard + image: tensorflow/tensorflow:1.10.0 + imagePullPolicy: Always + command: + - /usr/local/bin/tensorboard + args: + - --logdir + - /tmp/tensorflow/logs + volumeMounts: + - mountPath: /tmp/tensorflow + subPath: tf-mnist-w-tb + name: samba-share-volume2 + ports: + - containerPort: 6006 + protocol: TCP + dnsPolicy: ClusterFirst + restartPolicy: Always + \ No newline at end of file diff --git a/Research/kubeflow-on-azure-stack-lab/03-PyTorchJobs/pytorch-dist-mnist-gloo-demo-with_persistence.yaml b/Research/kubeflow-on-azure-stack-lab/03-PyTorchJobs/pytorch-dist-mnist-gloo-demo-with_persistence.yaml index ea964eb0f..ca7c4b627 100644 --- a/Research/kubeflow-on-azure-stack-lab/03-PyTorchJobs/pytorch-dist-mnist-gloo-demo-with_persistence.yaml +++ b/Research/kubeflow-on-azure-stack-lab/03-PyTorchJobs/pytorch-dist-mnist-gloo-demo-with_persistence.yaml @@ -11,7 +11,7 @@ spec: spec: containers: - name: pytorch - image: kubeflow/pytorch-dist-mnist-test:1.0 + image: rollingstone/pytorch-dist-mnist-test:1.0 args: ["--backend", "gloo"] # Comment out the below resources to use the CPU. resources: @@ -31,7 +31,7 @@ spec: spec: containers: - name: pytorch - image: kubeflow/pytorch-dist-mnist-test:1.0 + image: rollingstone/pytorch-dist-mnist-test:1.0 args: ["--backend", "gloo"] # Comment out the below resources to use the CPU. resources: diff --git a/Research/kubeflow-on-azure-stack-lab/03-PyTorchJobs/pytorch-dist-mnist-gloo-demo.yaml b/Research/kubeflow-on-azure-stack-lab/03-PyTorchJobs/pytorch-dist-mnist-gloo-demo.yaml index 7e6fef814..8c9862688 100644 --- a/Research/kubeflow-on-azure-stack-lab/03-PyTorchJobs/pytorch-dist-mnist-gloo-demo.yaml +++ b/Research/kubeflow-on-azure-stack-lab/03-PyTorchJobs/pytorch-dist-mnist-gloo-demo.yaml @@ -11,7 +11,7 @@ spec: spec: containers: - name: pytorch - image: kubeflow/pytorch-dist-mnist-test:1.0 + image: rollingstone/pytorch-dist-mnist-test:1.0 args: ["--backend", "gloo"] # Comment out the below resources to use the CPU. resources: @@ -24,7 +24,7 @@ spec: spec: containers: - name: pytorch - image: kubeflow/pytorch-dist-mnist-test:1.0 + image: rollingstone/pytorch-dist-mnist-test:1.0 args: ["--backend", "gloo"] # Comment out the below resources to use the CPU. resources: diff --git a/Research/kubeflow-on-azure-stack-lab/03-PyTorchJobs/tb_pytorch.yaml b/Research/kubeflow-on-azure-stack-lab/03-PyTorchJobs/tb_pytorch.yaml index 55c76db56..588eee38e 100644 --- a/Research/kubeflow-on-azure-stack-lab/03-PyTorchJobs/tb_pytorch.yaml +++ b/Research/kubeflow-on-azure-stack-lab/03-PyTorchJobs/tb_pytorch.yaml @@ -1,38 +1,53 @@ -apiVersion: extensions/v1beta1 +apiVersion: apps/v1 kind: Deployment metadata: + creationTimestamp: null labels: app: tensorboard name: tensorboard spec: + progressDeadlineSeconds: 2147483647 replicas: 1 + revisionHistoryLimit: 2147483647 selector: matchLabels: app: tensorboard + strategy: + rollingUpdate: + maxSurge: 1 + maxUnavailable: 1 + type: RollingUpdate template: metadata: + creationTimestamp: null labels: app: tensorboard spec: - volumes: - - name: samba-share-volume2 - persistentVolumeClaim: - claimName: samba-share-claim containers: - - name: tensorboard - image: tensorflow/tensorflow:1.10.0 - imagePullPolicy: Always - command: - - /usr/local/bin/tensorboard - args: + - args: - --logdir - /tmp/tensorflow/logs - volumeMounts: - - mountPath: /tmp/tensorflow - # subPath: pytorch-tb - name: samba-share-volume2 + command: + - /usr/local/bin/tensorboard + image: tensorflow/tensorflow:1.10.0 + imagePullPolicy: Always + name: tensorboard ports: - containerPort: 6006 protocol: TCP + resources: {} + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + volumeMounts: + - mountPath: /tmp/tensorflow + name: samba-share-volume2 dnsPolicy: ClusterFirst - restartPolicy: Always \ No newline at end of file + restartPolicy: Always + schedulerName: default-scheduler + securityContext: {} + terminationGracePeriodSeconds: 30 + volumes: + - name: samba-share-volume2 + persistentVolumeClaim: + claimName: samba-share-claim +status: {} diff --git a/Research/kubeflow-on-azure-stack-lab/03-PyTorchJobs/tb_pytorch_v1beta1.yaml b/Research/kubeflow-on-azure-stack-lab/03-PyTorchJobs/tb_pytorch_v1beta1.yaml new file mode 100644 index 000000000..55c76db56 --- /dev/null +++ b/Research/kubeflow-on-azure-stack-lab/03-PyTorchJobs/tb_pytorch_v1beta1.yaml @@ -0,0 +1,38 @@ +apiVersion: extensions/v1beta1 +kind: Deployment +metadata: + labels: + app: tensorboard + name: tensorboard +spec: + replicas: 1 + selector: + matchLabels: + app: tensorboard + template: + metadata: + labels: + app: tensorboard + spec: + volumes: + - name: samba-share-volume2 + persistentVolumeClaim: + claimName: samba-share-claim + containers: + - name: tensorboard + image: tensorflow/tensorflow:1.10.0 + imagePullPolicy: Always + command: + - /usr/local/bin/tensorboard + args: + - --logdir + - /tmp/tensorflow/logs + volumeMounts: + - mountPath: /tmp/tensorflow + # subPath: pytorch-tb + name: samba-share-volume2 + ports: + - containerPort: 6006 + protocol: TCP + dnsPolicy: ClusterFirst + restartPolicy: Always \ No newline at end of file diff --git a/Research/kubeflow-on-azure-stack-lab/sbin/kubeflow_install.sh b/Research/kubeflow-on-azure-stack-lab/sbin/kubeflow_install.sh index 19a9c7165..59704bf2a 100644 --- a/Research/kubeflow-on-azure-stack-lab/sbin/kubeflow_install.sh +++ b/Research/kubeflow-on-azure-stack-lab/sbin/kubeflow_install.sh @@ -10,16 +10,15 @@ export KF_CTL_DIR=~/kubeflow/ export KF_NAME=sandboxASkf export KF_USERNAME=azureuser -##export KFCTL_RELEASE_FILENAME=kfctl_v1.1.0-0-g9a3621e_linux.tar.gz -##export KFCTL_RELEASE_URI="https://github.com/kubeflow/kfctl/releases/download/v1.1.0/${KFCTL_RELEASE_FILENAME}" -export KFCTL_RELEASE_FILENAME=kfctl_v1.0.2-0-ga476281_linux.tar.gz -export KFCTL_RELEASE_URI="https://github.com/kubeflow/kfctl/releases/download/v1.0.2/${KFCTL_RELEASE_FILENAME}" +export KFCTL_RELEASE_FILENAME=kfctl_v1.1.0-0-g9a3621e_linux.tar.gz +export KFCTL_RELEASE_URI="https://github.com/kubeflow/kfctl/releases/download/v1.1.0/${KFCTL_RELEASE_FILENAME}" +#export KFCTL_RELEASE_FILENAME=kfctl_v1.0.2-0-ga476281_linux.tar.gz +#export KFCTL_RELEASE_URI="https://github.com/kubeflow/kfctl/releases/download/v1.0.2/${KFCTL_RELEASE_FILENAME}" export KF_DIR_BASE=/opt -##export KF_CONFIG_FILENAME="kfctl_k8s_istio.v1.1.0.yaml" -##export KF_CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v1.1-branch/kfdef/${KF_CONFIG_FILENAME}" -export KF_CONFIG_FILENAME="kfctl_k8s_istio.v1.0.2.yaml" +export KF_CONFIG_FILENAME="kfctl_k8s_istio.v1.1.0.yaml" +#export KF_CONFIG_FILENAME="kfctl_k8s_istio.v1.0.2.yaml" export KF_CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v1.1-branch/kfdef/${KF_CONFIG_FILENAME}" export DO_UNATTENDED=0 diff --git a/Research/kubeflow-on-azure-stack-lab/sbin/tensorboard.yaml b/Research/kubeflow-on-azure-stack-lab/sbin/tensorboard.yaml index de3795173..588eee38e 100644 --- a/Research/kubeflow-on-azure-stack-lab/sbin/tensorboard.yaml +++ b/Research/kubeflow-on-azure-stack-lab/sbin/tensorboard.yaml @@ -1,37 +1,53 @@ -apiVersion: extensions/v1beta1 +apiVersion: apps/v1 kind: Deployment metadata: + creationTimestamp: null labels: app: tensorboard name: tensorboard spec: + progressDeadlineSeconds: 2147483647 replicas: 1 + revisionHistoryLimit: 2147483647 selector: matchLabels: app: tensorboard + strategy: + rollingUpdate: + maxSurge: 1 + maxUnavailable: 1 + type: RollingUpdate template: metadata: + creationTimestamp: null labels: app: tensorboard spec: - volumes: - - name: samba-share-volume2 - persistentVolumeClaim: - claimName: samba-share-claim containers: - - name: tensorboard - image: tensorflow/tensorflow:1.10.0 - imagePullPolicy: Always - command: - - /usr/local/bin/tensorboard - args: + - args: - --logdir - /tmp/tensorflow/logs - volumeMounts: - - mountPath: /tmp/tensorflow - name: samba-share-volume2 + command: + - /usr/local/bin/tensorboard + image: tensorflow/tensorflow:1.10.0 + imagePullPolicy: Always + name: tensorboard ports: - containerPort: 6006 protocol: TCP + resources: {} + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + volumeMounts: + - mountPath: /tmp/tensorflow + name: samba-share-volume2 dnsPolicy: ClusterFirst - restartPolicy: Always \ No newline at end of file + restartPolicy: Always + schedulerName: default-scheduler + securityContext: {} + terminationGracePeriodSeconds: 30 + volumes: + - name: samba-share-volume2 + persistentVolumeClaim: + claimName: samba-share-claim +status: {} diff --git a/Research/kubeflow-on-azure-stack-lab/sbin/tensorboard_v1beta1.yaml b/Research/kubeflow-on-azure-stack-lab/sbin/tensorboard_v1beta1.yaml new file mode 100644 index 000000000..4650bec1d --- /dev/null +++ b/Research/kubeflow-on-azure-stack-lab/sbin/tensorboard_v1beta1.yaml @@ -0,0 +1,38 @@ +apiVersion: extensions/v1beta1 +kind: Deployment +metadata: + labels: + app: tensorboard + name: tensorboard +spec: + replicas: 1 + selector: + matchLabels: + app: tensorboard + template: + metadata: + labels: + app: tensorboard + spec: + volumes: + - name: samba-share-volume2 + persistentVolumeClaim: + claimName: samba-share-claim + containers: + - name: tensorboard + image: tensorflow/tensorflow:1.10.0 + imagePullPolicy: Always + command: + - /usr/local/bin/tensorboard + args: + - --logdir + - /tmp/tensorflow/logs + volumeMounts: + - mountPath: /tmp/tensorflow + name: samba-share-volume2 + ports: + - containerPort: 6006 + protocol: TCP + dnsPolicy: ClusterFirst + restartPolicy: Always + \ No newline at end of file diff --git a/machine-learning-notebooks/deploying-model-on-k8s/Readme.md b/machine-learning-notebooks/deploying-model-on-k8s/Readme.md index df830f687..dad7d95f1 100644 --- a/machine-learning-notebooks/deploying-model-on-k8s/Readme.md +++ b/machine-learning-notebooks/deploying-model-on-k8s/Readme.md @@ -275,8 +275,6 @@ We provide the Deployment file, `deploy_infer.yaml`: - containerPort: 8888 resources: limits: - memory: "128Mi" #128 MB - cpu: "200m" # 200 millicpu (0.2 or 20% of the cpu) nvidia.com/gpu: 1 imagePullSecrets: - name: secret4acr2infer diff --git a/machine-learning-notebooks/deploying-model-on-k8s/deploy_infer.yaml b/machine-learning-notebooks/deploying-model-on-k8s/deploy_infer.yaml index c9b05b5c5..c54aeff63 100644 --- a/machine-learning-notebooks/deploying-model-on-k8s/deploy_infer.yaml +++ b/machine-learning-notebooks/deploying-model-on-k8s/deploy_infer.yaml @@ -30,8 +30,12 @@ spec: - containerPort: 8888 resources: limits: - memory: "128Mi" #128 MB - cpu: "200m" # 200 millicpu (0.2 or 20% of the cpu) + # if you know your models minimal requirements, you can control + # the resource usage here. Some models may not work unless they + # have enough. + # + # memory: "128Mi" #128 MB + # cpu: "200m" # 200 millicpu (0.2 or 20% of the cpu) nvidia.com/gpu: 1 imagePullSecrets: - name: secret4acr2infer