diff --git a/configuration.ipynb b/configuration.ipynb index 5a8de20ff..ae08c05ed 100644 --- a/configuration.ipynb +++ b/configuration.ipynb @@ -96,7 +96,7 @@ "source": [ "import azureml.core\n", "\n", - "print(\"This notebook was created using version 1.0.18 of the Azure ML SDK\")\n", + "print(\"This notebook was created using version 1.0.21 of the Azure ML SDK\")\n", "print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")" ] }, @@ -336,7 +336,7 @@ "\n", "In this notebook you configured this notebook library to connect easily to an Azure ML workspace. You can copy this notebook to your own libraries to connect them to you workspace, or use it to bootstrap new workspaces completely.\n", "\n", - "If you came here from another notebook, you can return there and complete that exercise, or you can try out the [Tutorials](./tutorials) or jump into \"how-to\" notebooks and start creating and deploying models. A good place to start is the [train in notebook](./how-to-use-azureml/training/train-in-notebook) example that walks through a simplified but complete end to end machine learning process." + "If you came here from another notebook, you can return there and complete that exercise, or you can try out the [Tutorials](./tutorials) or jump into \"how-to\" notebooks and start creating and deploying models. A good place to start is the [train within notebook](./how-to-use-azureml/training/train-within-notebook) example that walks through a simplified but complete end to end machine learning process." ] }, { diff --git a/how-to-use-azureml/automated-machine-learning/README.md b/how-to-use-azureml/automated-machine-learning/README.md index 048fa4bed..0d44cbc7c 100644 --- a/how-to-use-azureml/automated-machine-learning/README.md +++ b/how-to-use-azureml/automated-machine-learning/README.md @@ -42,21 +42,7 @@ Below are the three execution environments supported by AutoML. ## Running samples in a Local Conda environment To run these notebook on your own notebook server, use these installation instructions. - -The instructions below will install everything you need and then start a Jupyter notebook. To start your Jupyter notebook manually, use: - -``` -conda activate azure_automl -jupyter notebook -``` - -or on Mac: - -``` -source activate azure_automl -jupyter notebook -``` - +The instructions below will install everything you need and then start a Jupyter notebook. ### 1. Install mini-conda from [here](https://conda.io/miniconda.html), choose 64-bit Python 3.7 or higher. - **Note**: if you already have conda installed, you can keep using it but it should be version 4.4.10 or later (as shown by: conda -V). If you have a previous version installed, you can update it using the command: conda update conda. @@ -97,6 +83,21 @@ bash automl_setup_linux.sh - Please make sure you use the Python [conda env:azure_automl] kernel when trying the sample Notebooks. - Follow the instructions in the individual notebooks to explore various features in AutoML +### 6. Starting jupyter notebook manually +To start your Jupyter notebook manually, use: + +``` +conda activate azure_automl +jupyter notebook +``` + +or on Mac or Linux: + +``` +source activate azure_automl +jupyter notebook +``` + # Automated ML SDK Sample Notebooks diff --git a/how-to-use-azureml/automated-machine-learning/automl_env.yml b/how-to-use-azureml/automated-machine-learning/automl_env.yml deleted file mode 100644 index a3314efde..000000000 --- a/how-to-use-azureml/automated-machine-learning/automl_env.yml +++ /dev/null @@ -1,22 +0,0 @@ -name: azure_automl -dependencies: - # The python interpreter version. - # Currently Azure ML only supports 3.5.2 and later. -- python>=3.5.2,<3.6.8 -- nb_conda -- matplotlib==2.1.0 -- numpy>=1.11.0,<1.15.0 -- cython -- urllib3<1.24 -- scipy>=1.0.0,<=1.1.0 -- scikit-learn>=0.18.0,<=0.19.1 -- pandas>=0.22.0,<0.23.0 -- tensorflow>=1.12.0 -- py-xgboost<=0.80 - -- pip: - # Required packages for AzureML execution, history, and data preparation. - - azureml-sdk[automl,explain] - - azureml-widgets - - pandas_ml - diff --git a/how-to-use-azureml/automated-machine-learning/automl_env_mac.yml b/how-to-use-azureml/automated-machine-learning/automl_env_mac.yml deleted file mode 100644 index 432499bf0..000000000 --- a/how-to-use-azureml/automated-machine-learning/automl_env_mac.yml +++ /dev/null @@ -1,23 +0,0 @@ -name: azure_automl -dependencies: - # The python interpreter version. - # Currently Azure ML only supports 3.5.2 and later. -- python>=3.5.2,<3.6.8 -- nb_conda -- matplotlib==2.1.0 -- numpy>=1.15.3 -- cython -- urllib3<1.24 -- scipy>=1.0.0,<=1.1.0 -- scikit-learn>=0.18.0,<=0.19.1 -- pandas>=0.22.0,<0.23.0 -- tensorflow>=1.12.0 -- py-xgboost<=0.80 - -- pip: - # Required packages for AzureML execution, history, and data preparation. - - azureml-sdk[automl,explain] - - azureml-widgets - - pandas_ml - - diff --git a/how-to-use-azureml/automated-machine-learning/automl_setup.cmd b/how-to-use-azureml/automated-machine-learning/automl_setup.cmd deleted file mode 100644 index 2ef804201..000000000 --- a/how-to-use-azureml/automated-machine-learning/automl_setup.cmd +++ /dev/null @@ -1,51 +0,0 @@ -@echo off -set conda_env_name=%1 -set automl_env_file=%2 -set options=%3 -set PIP_NO_WARN_SCRIPT_LOCATION=0 - -IF "%conda_env_name%"=="" SET conda_env_name="azure_automl" -IF "%automl_env_file%"=="" SET automl_env_file="automl_env.yml" - -IF NOT EXIST %automl_env_file% GOTO YmlMissing - -call conda activate %conda_env_name% 2>nul: - -if not errorlevel 1 ( - echo Upgrading azureml-sdk[automl,notebooks,explain] in existing conda environment %conda_env_name% - call pip install --upgrade azureml-sdk[automl,notebooks,explain] - if errorlevel 1 goto ErrorExit -) else ( - call conda env create -f %automl_env_file% -n %conda_env_name% -) - -call conda activate %conda_env_name% 2>nul: -if errorlevel 1 goto ErrorExit - -call python -m ipykernel install --user --name %conda_env_name% --display-name "Python (%conda_env_name%)" - -REM azureml.widgets is now installed as part of the pip install under the conda env. -REM Removing the old user install so that the notebooks will use the latest widget. -call jupyter nbextension uninstall --user --py azureml.widgets - -echo. -echo. -echo *************************************** -echo * AutoML setup completed successfully * -echo *************************************** -IF NOT "%options%"=="nolaunch" ( - echo. - echo Starting jupyter notebook - please run the configuration notebook - echo. - jupyter notebook --log-level=50 --notebook-dir='..\..' -) - -goto End - -:YmlMissing -echo File %automl_env_file% not found. - -:ErrorExit -echo Install failed - -:End \ No newline at end of file diff --git a/how-to-use-azureml/automated-machine-learning/automl_setup_linux.sh b/how-to-use-azureml/automated-machine-learning/automl_setup_linux.sh deleted file mode 100644 index db8a357c6..000000000 --- a/how-to-use-azureml/automated-machine-learning/automl_setup_linux.sh +++ /dev/null @@ -1,52 +0,0 @@ -#!/bin/bash - -CONDA_ENV_NAME=$1 -AUTOML_ENV_FILE=$2 -OPTIONS=$3 -PIP_NO_WARN_SCRIPT_LOCATION=0 - -if [ "$CONDA_ENV_NAME" == "" ] -then - CONDA_ENV_NAME="azure_automl" -fi - -if [ "$AUTOML_ENV_FILE" == "" ] -then - AUTOML_ENV_FILE="automl_env.yml" -fi - -if [ ! -f $AUTOML_ENV_FILE ]; then - echo "File $AUTOML_ENV_FILE not found" - exit 1 -fi - -if source activate $CONDA_ENV_NAME 2> /dev/null -then - echo "Upgrading azureml-sdk[automl,notebooks,explain] in existing conda environment" $CONDA_ENV_NAME - pip install --upgrade azureml-sdk[automl,notebooks,explain] && - jupyter nbextension uninstall --user --py azureml.widgets -else - conda env create -f $AUTOML_ENV_FILE -n $CONDA_ENV_NAME && - source activate $CONDA_ENV_NAME && - python -m ipykernel install --user --name $CONDA_ENV_NAME --display-name "Python ($CONDA_ENV_NAME)" && - jupyter nbextension uninstall --user --py azureml.widgets && - echo "" && - echo "" && - echo "***************************************" && - echo "* AutoML setup completed successfully *" && - echo "***************************************" && - if [ "$OPTIONS" != "nolaunch" ] - then - echo "" && - echo "Starting jupyter notebook - please run the configuration notebook" && - echo "" && - jupyter notebook --log-level=50 --notebook-dir '../..' - fi -fi - -if [ $? -gt 0 ] -then - echo "Installation failed" -fi - - diff --git a/how-to-use-azureml/automated-machine-learning/automl_setup_mac.sh b/how-to-use-azureml/automated-machine-learning/automl_setup_mac.sh deleted file mode 100644 index 84a45e155..000000000 --- a/how-to-use-azureml/automated-machine-learning/automl_setup_mac.sh +++ /dev/null @@ -1,55 +0,0 @@ -#!/bin/bash - -CONDA_ENV_NAME=$1 -AUTOML_ENV_FILE=$2 -OPTIONS=$3 -PIP_NO_WARN_SCRIPT_LOCATION=0 - -if [ "$CONDA_ENV_NAME" == "" ] -then - CONDA_ENV_NAME="azure_automl" -fi - -if [ "$AUTOML_ENV_FILE" == "" ] -then - AUTOML_ENV_FILE="automl_env_mac.yml" -fi - -if [ ! -f $AUTOML_ENV_FILE ]; then - echo "File $AUTOML_ENV_FILE not found" - exit 1 -fi - -if source activate $CONDA_ENV_NAME 2> /dev/null -then - echo "Upgrading azureml-sdk[automl,notebooks,explain] in existing conda environment" $CONDA_ENV_NAME - pip install --upgrade azureml-sdk[automl,notebooks,explain] && - jupyter nbextension uninstall --user --py azureml.widgets -else - conda env create -f $AUTOML_ENV_FILE -n $CONDA_ENV_NAME && - source activate $CONDA_ENV_NAME && - conda install lightgbm -c conda-forge -y && - python -m ipykernel install --user --name $CONDA_ENV_NAME --display-name "Python ($CONDA_ENV_NAME)" && - jupyter nbextension uninstall --user --py azureml.widgets && - pip install numpy==1.15.3 && - echo "" && - echo "" && - echo "***************************************" && - echo "* AutoML setup completed successfully *" && - echo "***************************************" && - if [ "$OPTIONS" != "nolaunch" ] - then - echo "" && - echo "Starting jupyter notebook - please run the configuration notebook" && - echo "" && - jupyter notebook --log-level=50 --notebook-dir '../..' - fi -fi - -if [ $? -gt 0 ] -then - echo "Installation failed" -fi - - - diff --git a/how-to-use-azureml/automated-machine-learning/classification-with-deployment/auto-ml-classification-with-deployment.ipynb b/how-to-use-azureml/automated-machine-learning/classification-with-deployment/auto-ml-classification-with-deployment.ipynb index 8937d8f7c..f6093b481 100644 --- a/how-to-use-azureml/automated-machine-learning/classification-with-deployment/auto-ml-classification-with-deployment.ipynb +++ b/how-to-use-azureml/automated-machine-learning/classification-with-deployment/auto-ml-classification-with-deployment.ipynb @@ -119,7 +119,7 @@ "|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|\n", "|**n_cross_validations**|Number of cross validation splits.|\n", "|**X**|(sparse) array-like, shape = [n_samples, n_features]|\n", - "|**y**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]
Multi-class targets. An indicator matrix turns on multilabel classification. This should be an array of integers.|\n", + "|**y**|(sparse) array-like, shape = [n_samples, ], Multi-class targets.|\n", "|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|" ] }, diff --git a/how-to-use-azureml/automated-machine-learning/classification-with-whitelisting/auto-ml-classification-with-whitelisting.ipynb b/how-to-use-azureml/automated-machine-learning/classification-with-whitelisting/auto-ml-classification-with-whitelisting.ipynb index 27f399a39..1b20301a2 100644 --- a/how-to-use-azureml/automated-machine-learning/classification-with-whitelisting/auto-ml-classification-with-whitelisting.ipynb +++ b/how-to-use-azureml/automated-machine-learning/classification-with-whitelisting/auto-ml-classification-with-whitelisting.ipynb @@ -60,6 +60,7 @@ "metadata": {}, "outputs": [], "source": [ + "#Note: This notebook will install tensorflow if not already installed in the enviornment..\n", "import logging\n", "\n", "from matplotlib import pyplot as plt\n", @@ -70,6 +71,11 @@ "import azureml.core\n", "from azureml.core.experiment import Experiment\n", "from azureml.core.workspace import Workspace\n", + "try:\n", + " import tensorflow as tf1\n", + "except ImportError:\n", + " from pip._internal import main\n", + " main(['install', 'tensorflow>=1.10.0,<=1.12.0'])\n", "from azureml.train.automl import AutoMLConfig" ] }, @@ -138,7 +144,7 @@ "|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|\n", "|**n_cross_validations**|Number of cross validation splits.|\n", "|**X**|(sparse) array-like, shape = [n_samples, n_features]|\n", - "|**y**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]
Multi-class targets. An indicator matrix turns on multilabel classification. This should be an array of integers.|\n", + "|**y**|(sparse) array-like, shape = [n_samples, ], Multi-class targets.|\n", "|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|\n", "|**whitelist_models**|List of models that AutoML should use. The possible values are listed [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train#configure-your-experiment-settings).|" ] diff --git a/how-to-use-azureml/automated-machine-learning/classification/auto-ml-classification.ipynb b/how-to-use-azureml/automated-machine-learning/classification/auto-ml-classification.ipynb index 3594af18e..03d9c8eb5 100644 --- a/how-to-use-azureml/automated-machine-learning/classification/auto-ml-classification.ipynb +++ b/how-to-use-azureml/automated-machine-learning/classification/auto-ml-classification.ipynb @@ -1,443 +1,443 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Copyright (c) Microsoft Corporation. All rights reserved.\n", - "\n", - "Licensed under the MIT License." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Automated Machine Learning\n", - "_**Classification with Local Compute**_\n", - "\n", - "## Contents\n", - "1. [Introduction](#Introduction)\n", - "1. [Setup](#Setup)\n", - "1. [Data](#Data)\n", - "1. [Train](#Train)\n", - "1. [Results](#Results)\n", - "1. [Test](#Test)\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Introduction\n", - "\n", - "In this example we use the scikit-learn's [digit dataset](http://scikit-learn.org/stable/datasets/index.html#optical-recognition-of-handwritten-digits-dataset) to showcase how you can use AutoML for a simple classification problem.\n", - "\n", - "Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.\n", - "\n", - "In this notebook you will learn how to:\n", - "1. Create an `Experiment` in an existing `Workspace`.\n", - "2. Configure AutoML using `AutoMLConfig`.\n", - "3. Train the model using local compute.\n", - "4. Explore the results.\n", - "5. Test the best fitted model." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setup\n", - "\n", - "As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import logging\n", - "\n", - "from matplotlib import pyplot as plt\n", - "import numpy as np\n", - "import pandas as pd\n", - "from sklearn import datasets\n", - "\n", - "import azureml.core\n", - "from azureml.core.experiment import Experiment\n", - "from azureml.core.workspace import Workspace\n", - "from azureml.train.automl import AutoMLConfig" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ws = Workspace.from_config()\n", - "\n", - "# Choose a name for the experiment and specify the project folder.\n", - "experiment_name = 'automl-classification'\n", - "project_folder = './sample_projects/automl-classification'\n", - "\n", - "experiment = Experiment(ws, experiment_name)\n", - "\n", - "output = {}\n", - "output['SDK version'] = azureml.core.VERSION\n", - "output['Subscription ID'] = ws.subscription_id\n", - "output['Workspace Name'] = ws.name\n", - "output['Resource Group'] = ws.resource_group\n", - "output['Location'] = ws.location\n", - "output['Project Directory'] = project_folder\n", - "output['Experiment Name'] = experiment.name\n", - "pd.set_option('display.max_colwidth', -1)\n", - "outputDf = pd.DataFrame(data = output, index = [''])\n", - "outputDf.T" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Data\n", - "\n", - "This uses scikit-learn's [load_digits](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) method." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "digits = datasets.load_digits()\n", - "\n", - "# Exclude the first 100 rows from training so that they can be used for test.\n", - "X_train = digits.data[100:,:]\n", - "y_train = digits.target[100:]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Train\n", - "\n", - "Instantiate an `AutoMLConfig` object to specify the settings and data used to run the experiment.\n", - "\n", - "|Property|Description|\n", - "|-|-|\n", - "|**task**|classification or regression|\n", - "|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics:
accuracy
AUC_weighted
average_precision_score_weighted
norm_macro_recall
precision_score_weighted|\n", - "|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|\n", - "|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|\n", - "|**n_cross_validations**|Number of cross validation splits.|\n", - "|**X**|(sparse) array-like, shape = [n_samples, n_features]|\n", - "|**y**|(sparse) array-like, shape = [n_samples, ], Multi-class targets.|\n", - "|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "automl_config = AutoMLConfig(task = 'classification',\n", - " debug_log = 'automl_errors.log',\n", - " primary_metric = 'AUC_weighted',\n", - " iteration_timeout_minutes = 60,\n", - " iterations = 25,\n", - " n_cross_validations = 3,\n", - " verbosity = logging.INFO,\n", - " X = X_train, \n", - " y = y_train,\n", - " path = project_folder)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.\n", - "In this example, we specify `show_output = True` to print currently running iterations to the console." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "local_run = experiment.submit(automl_config, show_output = True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "local_run" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Optionally, you can continue an interrupted local run by calling `continue_experiment` without the `iterations` parameter, or run more iterations for a completed run by specifying the `iterations` parameter:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "local_run = local_run.continue_experiment(X = X_train, \n", - " y = y_train, \n", - " show_output = True,\n", - " iterations = 5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Results" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Widget for Monitoring Runs\n", - "\n", - "The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n", - "\n", - "**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from azureml.widgets import RunDetails\n", - "RunDetails(local_run).show() " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "#### Retrieve All Child Runs\n", - "You can also use SDK methods to fetch all the child runs and see individual metrics that we log." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "children = list(local_run.get_children())\n", - "metricslist = {}\n", - "for run in children:\n", - " properties = run.get_properties()\n", - " metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}\n", - " metricslist[int(properties['iteration'])] = metrics\n", - "\n", - "rundata = pd.DataFrame(metricslist).sort_index(1)\n", - "rundata" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Retrieve the Best Model\n", - "\n", - "Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "best_run, fitted_model = local_run.get_output()\n", - "print(best_run)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Print the properties of the model\n", - "The fitted_model is a python object and you can read the different properties of the object.\n", - "The following shows printing hyperparameters for each step in the pipeline." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from pprint import pprint\n", - "\n", - "def print_model(model, prefix=\"\"):\n", - " for step in model.steps:\n", - " print(prefix + step[0])\n", - " if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):\n", - " pprint({'estimators': list(e[0] for e in step[1].estimators), 'weights': step[1].weights})\n", - " print()\n", - " for estimator in step[1].estimators:\n", - " print_model(estimator[1], estimator[0]+ ' - ')\n", - " else:\n", - " pprint(step[1].get_params())\n", - " print()\n", - " \n", - "print_model(fitted_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Best Model Based on Any Other Metric\n", - "Show the run and the model that has the smallest `log_loss` value:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "lookup_metric = \"log_loss\"\n", - "best_run, fitted_model = local_run.get_output(metric = lookup_metric)\n", - "print(best_run)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print_model(fitted_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Model from a Specific Iteration\n", - "Show the run and the model from the third iteration:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "iteration = 3\n", - "third_run, third_model = local_run.get_output(iteration = iteration)\n", - "print(third_run)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print_model(third_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Test \n", - "\n", - "#### Load Test Data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "digits = datasets.load_digits()\n", - "X_test = digits.data[:10, :]\n", - "y_test = digits.target[:10]\n", - "images = digits.images[:10]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Testing Our Best Fitted Model\n", - "We will try to predict 2 digits and see how our model works." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Randomly select digits and test.\n", - "for index in np.random.choice(len(y_test), 2, replace = False):\n", - " print(index)\n", - " predicted = fitted_model.predict(X_test[index:index + 1])[0]\n", - " label = y_test[index]\n", - " title = \"Label value = %d Predicted value = %d \" % (label, predicted)\n", - " fig = plt.figure(1, figsize = (3,3))\n", - " ax1 = fig.add_axes((0,0,.8,.8))\n", - " ax1.set_title(title)\n", - " plt.imshow(images[index], cmap = plt.cm.gray_r, interpolation = 'nearest')\n", - " plt.show()" - ] - } - ], - "metadata": { - "authors": [ - { - "name": "savitam" - } + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Copyright (c) Microsoft Corporation. All rights reserved.\n", + "\n", + "Licensed under the MIT License." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Automated Machine Learning\n", + "_**Classification with Local Compute**_\n", + "\n", + "## Contents\n", + "1. [Introduction](#Introduction)\n", + "1. [Setup](#Setup)\n", + "1. [Data](#Data)\n", + "1. [Train](#Train)\n", + "1. [Results](#Results)\n", + "1. [Test](#Test)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Introduction\n", + "\n", + "In this example we use the scikit-learn's [digit dataset](http://scikit-learn.org/stable/datasets/index.html#optical-recognition-of-handwritten-digits-dataset) to showcase how you can use AutoML for a simple classification problem.\n", + "\n", + "Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.\n", + "\n", + "In this notebook you will learn how to:\n", + "1. Create an `Experiment` in an existing `Workspace`.\n", + "2. Configure AutoML using `AutoMLConfig`.\n", + "3. Train the model using local compute.\n", + "4. Explore the results.\n", + "5. Test the best fitted model." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import logging\n", + "\n", + "from matplotlib import pyplot as plt\n", + "import numpy as np\n", + "import pandas as pd\n", + "from sklearn import datasets\n", + "\n", + "import azureml.core\n", + "from azureml.core.experiment import Experiment\n", + "from azureml.core.workspace import Workspace\n", + "from azureml.train.automl import AutoMLConfig" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ws = Workspace.from_config()\n", + "\n", + "# Choose a name for the experiment and specify the project folder.\n", + "experiment_name = 'automl-classification'\n", + "project_folder = './sample_projects/automl-classification'\n", + "\n", + "experiment = Experiment(ws, experiment_name)\n", + "\n", + "output = {}\n", + "output['SDK version'] = azureml.core.VERSION\n", + "output['Subscription ID'] = ws.subscription_id\n", + "output['Workspace Name'] = ws.name\n", + "output['Resource Group'] = ws.resource_group\n", + "output['Location'] = ws.location\n", + "output['Project Directory'] = project_folder\n", + "output['Experiment Name'] = experiment.name\n", + "pd.set_option('display.max_colwidth', -1)\n", + "outputDf = pd.DataFrame(data = output, index = [''])\n", + "outputDf.T" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data\n", + "\n", + "This uses scikit-learn's [load_digits](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "digits = datasets.load_digits()\n", + "\n", + "# Exclude the first 100 rows from training so that they can be used for test.\n", + "X_train = digits.data[100:,:]\n", + "y_train = digits.target[100:]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Train\n", + "\n", + "Instantiate an `AutoMLConfig` object to specify the settings and data used to run the experiment.\n", + "\n", + "|Property|Description|\n", + "|-|-|\n", + "|**task**|classification or regression|\n", + "|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics:
accuracy
AUC_weighted
average_precision_score_weighted
norm_macro_recall
precision_score_weighted|\n", + "|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|\n", + "|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|\n", + "|**n_cross_validations**|Number of cross validation splits.|\n", + "|**X**|(sparse) array-like, shape = [n_samples, n_features]|\n", + "|**y**|(sparse) array-like, shape = [n_samples, ], Multi-class targets.|\n", + "|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "automl_config = AutoMLConfig(task = 'classification',\n", + " debug_log = 'automl_errors.log',\n", + " primary_metric = 'AUC_weighted',\n", + " iteration_timeout_minutes = 60,\n", + " iterations = 25,\n", + " n_cross_validations = 3,\n", + " verbosity = logging.INFO,\n", + " X = X_train, \n", + " y = y_train,\n", + " path = project_folder)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.\n", + "In this example, we specify `show_output = True` to print currently running iterations to the console." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "local_run = experiment.submit(automl_config, show_output = True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "local_run" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Optionally, you can continue an interrupted local run by calling `continue_experiment` without the `iterations` parameter, or run more iterations for a completed run by specifying the `iterations` parameter:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "local_run = local_run.continue_experiment(X = X_train, \n", + " y = y_train, \n", + " show_output = True,\n", + " iterations = 5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Widget for Monitoring Runs\n", + "\n", + "The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n", + "\n", + "**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azureml.widgets import RunDetails\n", + "RunDetails(local_run).show() " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "#### Retrieve All Child Runs\n", + "You can also use SDK methods to fetch all the child runs and see individual metrics that we log." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "children = list(local_run.get_children())\n", + "metricslist = {}\n", + "for run in children:\n", + " properties = run.get_properties()\n", + " metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}\n", + " metricslist[int(properties['iteration'])] = metrics\n", + "\n", + "rundata = pd.DataFrame(metricslist).sort_index(1)\n", + "rundata" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Retrieve the Best Model\n", + "\n", + "Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "best_run, fitted_model = local_run.get_output()\n", + "print(best_run)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Print the properties of the model\n", + "The fitted_model is a python object and you can read the different properties of the object.\n", + "The following shows printing hyperparameters for each step in the pipeline." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pprint import pprint\n", + "\n", + "def print_model(model, prefix=\"\"):\n", + " for step in model.steps:\n", + " print(prefix + step[0])\n", + " if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):\n", + " pprint({'estimators': list(e[0] for e in step[1].estimators), 'weights': step[1].weights})\n", + " print()\n", + " for estimator in step[1].estimators:\n", + " print_model(estimator[1], estimator[0]+ ' - ')\n", + " else:\n", + " pprint(step[1].get_params())\n", + " print()\n", + " \n", + "print_model(fitted_model)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Best Model Based on Any Other Metric\n", + "Show the run and the model that has the smallest `log_loss` value:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "lookup_metric = \"log_loss\"\n", + "best_run, fitted_model = local_run.get_output(metric = lookup_metric)\n", + "print(best_run)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print_model(fitted_model)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Model from a Specific Iteration\n", + "Show the run and the model from the third iteration:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "iteration = 3\n", + "third_run, third_model = local_run.get_output(iteration = iteration)\n", + "print(third_run)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print_model(third_model)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Test \n", + "\n", + "#### Load Test Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "digits = datasets.load_digits()\n", + "X_test = digits.data[:10, :]\n", + "y_test = digits.target[:10]\n", + "images = digits.images[:10]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Testing Our Best Fitted Model\n", + "We will try to predict 2 digits and see how our model works." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Randomly select digits and test.\n", + "for index in np.random.choice(len(y_test), 2, replace = False):\n", + " print(index)\n", + " predicted = fitted_model.predict(X_test[index:index + 1])[0]\n", + " label = y_test[index]\n", + " title = \"Label value = %d Predicted value = %d \" % (label, predicted)\n", + " fig = plt.figure(1, figsize = (3,3))\n", + " ax1 = fig.add_axes((0,0,.8,.8))\n", + " ax1.set_title(title)\n", + " plt.imshow(images[index], cmap = plt.cm.gray_r, interpolation = 'nearest')\n", + " plt.show()" + ] + } ], - "kernelspec": { - "display_name": "Python 3.6", - "language": "python", - "name": "python36" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.6" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} + "metadata": { + "authors": [ + { + "name": "savitam" + } + ], + "kernelspec": { + "display_name": "Python 3.6", + "language": "python", + "name": "python36" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.6" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} \ No newline at end of file diff --git a/how-to-use-azureml/automated-machine-learning/dataprep-remote-execution/auto-ml-dataprep-remote-execution.ipynb b/how-to-use-azureml/automated-machine-learning/dataprep-remote-execution/auto-ml-dataprep-remote-execution.ipynb index d28675094..304aed969 100644 --- a/how-to-use-azureml/automated-machine-learning/dataprep-remote-execution/auto-ml-dataprep-remote-execution.ipynb +++ b/how-to-use-azureml/automated-machine-learning/dataprep-remote-execution/auto-ml-dataprep-remote-execution.ipynb @@ -211,7 +211,7 @@ "\n", "conda_run_config.target = dsvm_compute\n", "\n", - "cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'], conda_packages=['numpy'])\n", + "cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'], conda_packages=['numpy','py-xgboost<=0.80'])\n", "conda_run_config.environment.python.conda_dependencies = cd" ] }, diff --git a/how-to-use-azureml/automated-machine-learning/forecasting-energy-demand/auto-ml-forecasting-energy-demand.ipynb b/how-to-use-azureml/automated-machine-learning/forecasting-energy-demand/auto-ml-forecasting-energy-demand.ipynb index 049872dfb..def4ef8e5 100644 --- a/how-to-use-azureml/automated-machine-learning/forecasting-energy-demand/auto-ml-forecasting-energy-demand.ipynb +++ b/how-to-use-azureml/automated-machine-learning/forecasting-energy-demand/auto-ml-forecasting-energy-demand.ipynb @@ -197,9 +197,9 @@ "|**iterations**|Number of iterations. In each iteration, Auto ML trains a specific pipeline on the given data|\n", "|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|\n", "|**X**|(sparse) array-like, shape = [n_samples, n_features]|\n", - "|**y**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]
Multi-class targets. An indicator matrix turns on multilabel classification. This should be an array of integers. |\n", + "|**y**|(sparse) array-like, shape = [n_samples, ], targets values.|\n", "|**X_valid**|Data used to evaluate a model in a iteration. (sparse) array-like, shape = [n_samples, n_features]|\n", - "|**y_valid**|Data used to evaluate a model in a iteration. (sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]
Multi-class targets. An indicator matrix turns on multilabel classification. This should be an array of integers. |\n", + "|**y_valid**|Data used to evaluate a model in a iteration. (sparse) array-like, shape = [n_samples, ], targets values.|\n", "|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder. " ] }, diff --git a/how-to-use-azureml/automated-machine-learning/missing-data-blacklist-early-termination/auto-ml-missing-data-blacklist-early-termination.ipynb b/how-to-use-azureml/automated-machine-learning/missing-data-blacklist-early-termination/auto-ml-missing-data-blacklist-early-termination.ipynb index e6802499b..83a6bfa8c 100644 --- a/how-to-use-azureml/automated-machine-learning/missing-data-blacklist-early-termination/auto-ml-missing-data-blacklist-early-termination.ipynb +++ b/how-to-use-azureml/automated-machine-learning/missing-data-blacklist-early-termination/auto-ml-missing-data-blacklist-early-termination.ipynb @@ -159,7 +159,7 @@ "|**experiment_exit_score**|*double* value indicating the target for *primary_metric*.
Once the target is surpassed the run terminates.|\n", "|**blacklist_models**|*List* of *strings* indicating machine learning algorithms for AutoML to avoid in this run.

Allowed values for **Classification**
LogisticRegression
SGD
MultinomialNaiveBayes
BernoulliNaiveBayes
SVM
LinearSVM
KNN
DecisionTree
RandomForest
ExtremeRandomTrees
LightGBM
GradientBoosting
TensorFlowDNN
TensorFlowLinearClassifier

Allowed values for **Regression**
ElasticNet
GradientBoosting
DecisionTree
KNN
LassoLars
SGD
RandomForest
ExtremeRandomTrees
LightGBM
TensorFlowLinearRegressor
TensorFlowDNN|\n", "|**X**|(sparse) array-like, shape = [n_samples, n_features]|\n", - "|**y**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]
Multi-class targets. An indicator matrix turns on multilabel classification. This should be an array of integers.|\n", + "|**y**|(sparse) array-like, shape = [n_samples, ], Multi-class targets.|\n", "|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|" ] }, diff --git a/how-to-use-azureml/automated-machine-learning/model-explanation/auto-ml-model-explanation.ipynb b/how-to-use-azureml/automated-machine-learning/model-explanation/auto-ml-model-explanation.ipynb index 8d9b5a124..ec1766747 100644 --- a/how-to-use-azureml/automated-machine-learning/model-explanation/auto-ml-model-explanation.ipynb +++ b/how-to-use-azureml/automated-machine-learning/model-explanation/auto-ml-model-explanation.ipynb @@ -140,9 +140,9 @@ "|**max_time_sec**|Time limit in minutes for each iterations|\n", "|**iterations**|Number of iterations. In each iteration Auto ML trains the data with a specific pipeline|\n", "|**X**|(sparse) array-like, shape = [n_samples, n_features]|\n", - "|**y**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]
Multi-class targets. An indicator matrix turns on multilabel classification. This should be an array of integers. |\n", + "|**y**|(sparse) array-like, shape = [n_samples, ], Multi-class targets.|\n", "|**X_valid**|(sparse) array-like, shape = [n_samples, n_features]|\n", - "|**y_valid**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]|\n", + "|**y_valid**|(sparse) array-like, shape = [n_samples, ], Multi-class targets.|\n", "|**model_explainability**|Indicate to explain each trained pipeline or not |\n", "|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder. |" ] diff --git a/how-to-use-azureml/automated-machine-learning/regression/auto-ml-regression.ipynb b/how-to-use-azureml/automated-machine-learning/regression/auto-ml-regression.ipynb index 74375c556..bb43ea3ba 100644 --- a/how-to-use-azureml/automated-machine-learning/regression/auto-ml-regression.ipynb +++ b/how-to-use-azureml/automated-machine-learning/regression/auto-ml-regression.ipynb @@ -137,7 +137,7 @@ "|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|\n", "|**n_cross_validations**|Number of cross validation splits.|\n", "|**X**|(sparse) array-like, shape = [n_samples, n_features]|\n", - "|**y**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]
Multi-class targets. An indicator matrix turns on multilabel classification. This should be an array of integers.|\n", + "|**y**|(sparse) array-like, shape = [n_samples, ], targets values.|\n", "|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|" ] }, diff --git a/how-to-use-azureml/automated-machine-learning/remote-amlcompute/auto-ml-remote-amlcompute.ipynb b/how-to-use-azureml/automated-machine-learning/remote-amlcompute/auto-ml-remote-amlcompute.ipynb index d781231de..38ffd41e4 100644 --- a/how-to-use-azureml/automated-machine-learning/remote-amlcompute/auto-ml-remote-amlcompute.ipynb +++ b/how-to-use-azureml/automated-machine-learning/remote-amlcompute/auto-ml-remote-amlcompute.ipynb @@ -220,7 +220,7 @@ "# set the data reference of the run coonfiguration\n", "conda_run_config.data_references = {ds.name: dr}\n", "\n", - "cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'], conda_packages=['numpy'])\n", + "cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'], conda_packages=['numpy','py-xgboost<=0.80'])\n", "conda_run_config.environment.python.conda_dependencies = cd" ] }, @@ -267,7 +267,7 @@ "outputs": [], "source": [ "automl_settings = {\n", - " \"iteration_timeout_minutes\": 2,\n", + " \"iteration_timeout_minutes\": 10,\n", " \"iterations\": 20,\n", " \"n_cross_validations\": 5,\n", " \"primary_metric\": 'AUC_weighted',\n", diff --git a/how-to-use-azureml/automated-machine-learning/remote-attach/auto-ml-remote-attach.ipynb b/how-to-use-azureml/automated-machine-learning/remote-attach/auto-ml-remote-attach.ipynb index 13ac6f374..c33d3892a 100644 --- a/how-to-use-azureml/automated-machine-learning/remote-attach/auto-ml-remote-attach.ipynb +++ b/how-to-use-azureml/automated-machine-learning/remote-attach/auto-ml-remote-attach.ipynb @@ -167,7 +167,7 @@ "# Set compute target to the Linux DSVM\n", "conda_run_config.target = dsvm_compute\n", "\n", - "cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'], conda_packages=['numpy'])\n", + "cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'], conda_packages=['numpy','py-xgboost<=0.80'])\n", "conda_run_config.environment.python.conda_dependencies = cd" ] }, diff --git a/how-to-use-azureml/automated-machine-learning/remote-execution-with-datastore/auto-ml-remote-execution-with-datastore.ipynb b/how-to-use-azureml/automated-machine-learning/remote-execution-with-datastore/auto-ml-remote-execution-with-datastore.ipynb index 28225c2e2..aeac93c44 100644 --- a/how-to-use-azureml/automated-machine-learning/remote-execution-with-datastore/auto-ml-remote-execution-with-datastore.ipynb +++ b/how-to-use-azureml/automated-machine-learning/remote-execution-with-datastore/auto-ml-remote-execution-with-datastore.ipynb @@ -254,7 +254,7 @@ "# set the data reference of the run coonfiguration\n", "conda_run_config.data_references = {ds.name: dr}\n", "\n", - "cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'], conda_packages=['numpy'])\n", + "cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'], conda_packages=['numpy','py-xgboost<=0.80'])\n", "conda_run_config.environment.python.conda_dependencies = cd" ] }, diff --git a/how-to-use-azureml/automated-machine-learning/remote-execution/auto-ml-remote-execution.ipynb b/how-to-use-azureml/automated-machine-learning/remote-execution/auto-ml-remote-execution.ipynb index 241c7a952..637d9a3f6 100644 --- a/how-to-use-azureml/automated-machine-learning/remote-execution/auto-ml-remote-execution.ipynb +++ b/how-to-use-azureml/automated-machine-learning/remote-execution/auto-ml-remote-execution.ipynb @@ -193,7 +193,7 @@ "# set the data reference of the run coonfiguration\n", "conda_run_config.data_references = {ds.name: dr}\n", "\n", - "cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'], conda_packages=['numpy'])\n", + "cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'], conda_packages=['numpy','py-xgboost<=0.80'])\n", "conda_run_config.environment.python.conda_dependencies = cd" ] }, diff --git a/how-to-use-azureml/automated-machine-learning/sparse-data-train-test-split/auto-ml-sparse-data-train-test-split.ipynb b/how-to-use-azureml/automated-machine-learning/sparse-data-train-test-split/auto-ml-sparse-data-train-test-split.ipynb index 5bcaccff3..ac1d59fcd 100644 --- a/how-to-use-azureml/automated-machine-learning/sparse-data-train-test-split/auto-ml-sparse-data-train-test-split.ipynb +++ b/how-to-use-azureml/automated-machine-learning/sparse-data-train-test-split/auto-ml-sparse-data-train-test-split.ipynb @@ -156,9 +156,9 @@ "|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|\n", "|**preprocess**|Setting this to *True* enables AutoML to perform preprocessing on the input to handle *missing data*, and to perform some common *feature extraction*.
**Note:** If input data is sparse, you cannot use *True*.|\n", "|**X**|(sparse) array-like, shape = [n_samples, n_features]|\n", - "|**y**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]
Multi-class targets. An indicator matrix turns on multilabel classification. This should be an array of integers.|\n", + "|**y**|(sparse) array-like, shape = [n_samples, ], Multi-class targets.|\n", "|**X_valid**|(sparse) array-like, shape = [n_samples, n_features] for the custom validation set.|\n", - "|**y_valid**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]
Multi-class targets. An indicator matrix turns on multilabel classification for the custom validation set.|\n", + "|**y_valid**|(sparse) array-like, shape = [n_samples, ], Multi-class targets.|\n", "|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|" ] }, diff --git a/how-to-use-azureml/azure-databricks/README.md b/how-to-use-azureml/azure-databricks/README.md index d0c135241..cef512746 100644 --- a/how-to-use-azureml/azure-databricks/README.md +++ b/how-to-use-azureml/azure-databricks/README.md @@ -22,7 +22,7 @@ Notebook 6 is an Automated ML sample notebook for Classification. Learn more about [how to use Azure Databricks as a development environment](https://docs.microsoft.com/azure/machine-learning/service/how-to-configure-environment#azure-databricks) for Azure Machine Learning service. **Databricks as a Compute Target from AML Pipelines** -You can use Azure Databricks as a compute target from [Azure Machine Learning Pipelines](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines). Take a look at this notebook for details: [aml-pipelines-use-databricks-as-compute-target.ipynb](aml-pipelines-use-databricks-as-compute-target.ipynb). +You can use Azure Databricks as a compute target from [Azure Machine Learning Pipelines](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines). Take a look at this notebook for details: [aml-pipelines-use-databricks-as-compute-target.ipynb](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/azure-databricks/databricks-as-remote-compute-target/aml-pipelines-use-databricks-as-compute-target.ipynb). For more on SDK concepts, please refer to [notebooks](https://github.com/Azure/MachineLearningNotebooks). diff --git a/how-to-use-azureml/azure-databricks/aml-pipelines-use-databricks-as-compute-target.ipynb b/how-to-use-azureml/azure-databricks/aml-pipelines-use-databricks-as-compute-target.ipynb deleted file mode 100644 index 649c197a5..000000000 --- a/how-to-use-azureml/azure-databricks/aml-pipelines-use-databricks-as-compute-target.ipynb +++ /dev/null @@ -1,714 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Copyright (c) Microsoft Corporation. All rights reserved. \n", - "Licensed under the MIT License." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Using Databricks as a Compute Target from Azure Machine Learning Pipeline\n", - "To use Databricks as a compute target from [Azure Machine Learning Pipeline](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines), a [DatabricksStep](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.databricks_step.databricksstep?view=azure-ml-py) is used. This notebook demonstrates the use of DatabricksStep in Azure Machine Learning Pipeline.\n", - "\n", - "The notebook will show:\n", - "1. Running an arbitrary Databricks notebook that the customer has in Databricks workspace\n", - "2. Running an arbitrary Python script that the customer has in DBFS\n", - "3. Running an arbitrary Python script that is available on local computer (will upload to DBFS, and then run in Databricks) \n", - "4. Running a JAR job that the customer has in DBFS.\n", - "\n", - "## Before you begin:\n", - "\n", - "1. **Create an Azure Databricks workspace** in the same subscription where you have your Azure Machine Learning workspace. You will need details of this workspace later on to define DatabricksStep. [Click here](https://ms.portal.azure.com/#blade/HubsExtension/Resources/resourceType/Microsoft.Databricks%2Fworkspaces) for more information.\n", - "2. **Create PAT (access token)**: Manually create a Databricks access token at the Azure Databricks portal. See [this](https://docs.databricks.com/api/latest/authentication.html#generate-a-token) for more information.\n", - "3. **Add demo notebook to ADB**: This notebook has a sample you can use as is. Launch Azure Databricks attached to your Azure Machine Learning workspace and add a new notebook. \n", - "4. **Create/attach a Blob storage** for use from ADB" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Add demo notebook to ADB Workspace\n", - "Copy and paste the below code to create a new notebook in your ADB workspace." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```python\n", - "# direct access\n", - "dbutils.widgets.get(\"myparam\")\n", - "p = getArgument(\"myparam\")\n", - "print (\"Param -\\'myparam':\")\n", - "print (p)\n", - "\n", - "dbutils.widgets.get(\"input\")\n", - "i = getArgument(\"input\")\n", - "print (\"Param -\\'input':\")\n", - "print (i)\n", - "\n", - "dbutils.widgets.get(\"output\")\n", - "o = getArgument(\"output\")\n", - "print (\"Param -\\'output':\")\n", - "print (o)\n", - "\n", - "n = i + \"/testdata.txt\"\n", - "df = spark.read.csv(n)\n", - "\n", - "display (df)\n", - "\n", - "data = [('value1', 'value2')]\n", - "df2 = spark.createDataFrame(data)\n", - "\n", - "z = o + \"/output.txt\"\n", - "df2.write.csv(z)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Azure Machine Learning and Pipeline SDK-specific imports" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "import azureml.core\n", - "from azureml.core.runconfig import JarLibrary\n", - "from azureml.core.compute import ComputeTarget, DatabricksCompute\n", - "from azureml.exceptions import ComputeTargetException\n", - "from azureml.core import Workspace, Experiment\n", - "from azureml.pipeline.core import Pipeline, PipelineData\n", - "from azureml.pipeline.steps import DatabricksStep\n", - "from azureml.core.datastore import Datastore\n", - "from azureml.data.data_reference import DataReference\n", - "\n", - "# Check core SDK version number\n", - "print(\"SDK version:\", azureml.core.VERSION)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Initialize Workspace\n", - "\n", - "Initialize a workspace object from persisted configuration. Make sure the config file is present at .\\config.json" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ws = Workspace.from_config()\n", - "print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\\n')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Attach Databricks compute target\n", - "Next, you need to add your Databricks workspace to Azure Machine Learning as a compute target and give it a name. You will use this name to refer to your Databricks workspace compute target inside Azure Machine Learning.\n", - "\n", - "- **Resource Group** - The resource group name of your Azure Machine Learning workspace\n", - "- **Databricks Workspace Name** - The workspace name of your Azure Databricks workspace\n", - "- **Databricks Access Token** - The access token you created in ADB\n", - "\n", - "**The Databricks workspace need to be present in the same subscription as your AML workspace**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Replace with your account info before running.\n", - " \n", - "db_compute_name=os.getenv(\"DATABRICKS_COMPUTE_NAME\", \"\") # Databricks compute name\n", - "db_resource_group=os.getenv(\"DATABRICKS_RESOURCE_GROUP\", \"\") # Databricks resource group\n", - "db_workspace_name=os.getenv(\"DATABRICKS_WORKSPACE_NAME\", \"\") # Databricks workspace name\n", - "db_access_token=os.getenv(\"DATABRICKS_ACCESS_TOKEN\", \"\") # Databricks access token\n", - " \n", - "try:\n", - " databricks_compute = DatabricksCompute(workspace=ws, name=db_compute_name)\n", - " print('Compute target {} already exists'.format(db_compute_name))\n", - "except ComputeTargetException:\n", - " print('Compute not found, will use below parameters to attach new one')\n", - " print('db_compute_name {}'.format(db_compute_name))\n", - " print('db_resource_group {}'.format(db_resource_group))\n", - " print('db_workspace_name {}'.format(db_workspace_name))\n", - " print('db_access_token {}'.format(db_access_token))\n", - " \n", - " config = DatabricksCompute.attach_configuration(\n", - " resource_group = db_resource_group,\n", - " workspace_name = db_workspace_name,\n", - " access_token= db_access_token)\n", - " databricks_compute=ComputeTarget.attach(ws, db_compute_name, config)\n", - " databricks_compute.wait_for_completion(True)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Data Connections with Inputs and Outputs\n", - "The DatabricksStep supports Azure Bloband ADLS for inputs and outputs. You also will need to define a [Secrets](https://docs.azuredatabricks.net/user-guide/secrets/index.html) scope to enable authentication to external data sources such as Blob and ADLS from Databricks.\n", - "\n", - "- Databricks documentation on [Azure Blob](https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-storage.html)\n", - "- Databricks documentation on [ADLS](https://docs.databricks.com/spark/latest/data-sources/azure/azure-datalake.html)\n", - "\n", - "### Type of Data Access\n", - "Databricks allows to interact with Azure Blob and ADLS in two ways.\n", - "- **Direct Access**: Databricks allows you to interact with Azure Blob or ADLS URIs directly. The input or output URIs will be mapped to a Databricks widget param in the Databricks notebook.\n", - "- **Mounting**: You will be supplied with additional parameters and secrets that will enable you to mount your ADLS or Azure Blob input or output location in your Databricks notebook." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Direct Access: Python sample code\n", - "If you have a data reference named \"input\" it will represent the URI of the input and you can access it directly in the Databricks python notebook like so:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```python\n", - "dbutils.widgets.get(\"input\")\n", - "y = getArgument(\"input\")\n", - "df = spark.read.csv(y)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Mounting: Python sample code for Azure Blob\n", - "Given an Azure Blob data reference named \"input\" the following widget params will be made available in the Databricks notebook:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```python\n", - "# This contains the input URI\n", - "dbutils.widgets.get(\"input\")\n", - "myinput_uri = getArgument(\"input\")\n", - "\n", - "# How to get the input datastore name inside ADB notebook\n", - "# This contains the name of a Databricks secret (in the predefined \"amlscope\" secret scope) \n", - "# that contians an access key or sas for the Azure Blob input (this name is obtained by appending \n", - "# the name of the input with \"_blob_secretname\". \n", - "dbutils.widgets.get(\"input_blob_secretname\") \n", - "myinput_blob_secretname = getArgument(\"input_blob_secretname\")\n", - "\n", - "# This contains the required configuration for mounting\n", - "dbutils.widgets.get(\"input_blob_config\")\n", - "myinput_blob_config = getArgument(\"input_blob_config\")\n", - "\n", - "# Usage\n", - "dbutils.fs.mount(\n", - " source = myinput_uri,\n", - " mount_point = \"/mnt/input\",\n", - " extra_configs = {myinput_blob_config:dbutils.secrets.get(scope = \"amlscope\", key = myinput_blob_secretname)})\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Mounting: Python sample code for ADLS\n", - "Given an ADLS data reference named \"input\" the following widget params will be made available in the Databricks notebook:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```python\n", - "# This contains the input URI\n", - "dbutils.widgets.get(\"input\") \n", - "myinput_uri = getArgument(\"input\")\n", - "\n", - "# This contains the client id for the service principal \n", - "# that has access to the adls input\n", - "dbutils.widgets.get(\"input_adls_clientid\") \n", - "myinput_adls_clientid = getArgument(\"input_adls_clientid\")\n", - "\n", - "# This contains the name of a Databricks secret (in the predefined \"amlscope\" secret scope) \n", - "# that contains the secret for the above mentioned service principal\n", - "dbutils.widgets.get(\"input_adls_secretname\") \n", - "myinput_adls_secretname = getArgument(\"input_adls_secretname\")\n", - "\n", - "# This contains the refresh url for the mounting configs\n", - "dbutils.widgets.get(\"input_adls_refresh_url\") \n", - "myinput_adls_refresh_url = getArgument(\"input_adls_refresh_url\")\n", - "\n", - "# Usage \n", - "configs = {\"dfs.adls.oauth2.access.token.provider.type\": \"ClientCredential\",\n", - " \"dfs.adls.oauth2.client.id\": myinput_adls_clientid,\n", - " \"dfs.adls.oauth2.credential\": dbutils.secrets.get(scope = \"amlscope\", key =myinput_adls_secretname),\n", - " \"dfs.adls.oauth2.refresh.url\": myinput_adls_refresh_url}\n", - "\n", - "dbutils.fs.mount(\n", - " source = myinput_uri,\n", - " mount_point = \"/mnt/output\",\n", - " extra_configs = configs)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Use Databricks from Azure Machine Learning Pipeline\n", - "To use Databricks as a compute target from Azure Machine Learning Pipeline, a DatabricksStep is used. Let's define a datasource (via DataReference) and intermediate data (via PipelineData) to be used in DatabricksStep." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Use the default blob storage\n", - "def_blob_store = Datastore(ws, \"workspaceblobstore\")\n", - "print('Datastore {} will be used'.format(def_blob_store.name))\n", - "\n", - "# We are uploading a sample file in the local directory to be used as a datasource\n", - "def_blob_store.upload_files(files=[\"./testdata.txt\"], target_path=\"dbtest\", overwrite=False)\n", - "\n", - "step_1_input = DataReference(datastore=def_blob_store, path_on_datastore=\"dbtest\",\n", - " data_reference_name=\"input\")\n", - "\n", - "step_1_output = PipelineData(\"output\", datastore=def_blob_store)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Add a DatabricksStep\n", - "Adds a Databricks notebook as a step in a Pipeline.\n", - "- ***name:** Name of the Module\n", - "- **inputs:** List of input connections for data consumed by this step. Fetch this inside the notebook using dbutils.widgets.get(\"input\")\n", - "- **outputs:** List of output port definitions for outputs produced by this step. Fetch this inside the notebook using dbutils.widgets.get(\"output\")\n", - "- **existing_cluster_id:** Cluster ID of an existing Interactive cluster on the Databricks workspace. If you are providing this, do not provide any of the parameters below that are used to create a new cluster such as spark_version, node_type, etc.\n", - "- **spark_version:** Version of spark for the databricks run cluster. default value: 4.0.x-scala2.11\n", - "- **node_type:** Azure vm node types for the databricks run cluster. default value: Standard_D3_v2\n", - "- **num_workers:** Specifies a static number of workers for the databricks run cluster\n", - "- **min_workers:** Specifies a min number of workers to use for auto-scaling the databricks run cluster\n", - "- **max_workers:** Specifies a max number of workers to use for auto-scaling the databricks run cluster\n", - "- **spark_env_variables:** Spark environment variables for the databricks run cluster (dictionary of {str:str}). default value: {'PYSPARK_PYTHON': '/databricks/python3/bin/python3'}\n", - "- **notebook_path:** Path to the notebook in the databricks instance. If you are providing this, do not provide python script related paramaters or JAR related parameters.\n", - "- **notebook_params:** Parameters for the databricks notebook (dictionary of {str:str}). Fetch this inside the notebook using dbutils.widgets.get(\"myparam\")\n", - "- **python_script_path:** The path to the python script in the DBFS or S3. If you are providing this, do not provide python_script_name which is used for uploading script from local machine.\n", - "- **python_script_params:** Parameters for the python script (list of str)\n", - "- **main_class_name:** The name of the entry point in a JAR module. If you are providing this, do not provide any python script or notebook related parameters.\n", - "- **jar_params:** Parameters for the JAR module (list of str)\n", - "- **python_script_name:** name of a python script on your local machine (relative to source_directory). If you are providing this do not provide python_script_path which is used to execute a remote python script; or any of the JAR or notebook related parameters.\n", - "- **source_directory:** folder that contains the script and other files\n", - "- **hash_paths:** list of paths to hash to detect a change in source_directory (script file is always hashed)\n", - "- **run_name:** Name in databricks for this run\n", - "- **timeout_seconds:** Timeout for the databricks run\n", - "- **runconfig:** Runconfig to use. Either pass runconfig or each library type as a separate parameter but do not mix the two\n", - "- **maven_libraries:** maven libraries for the databricks run\n", - "- **pypi_libraries:** pypi libraries for the databricks run\n", - "- **egg_libraries:** egg libraries for the databricks run\n", - "- **jar_libraries:** jar libraries for the databricks run\n", - "- **rcran_libraries:** rcran libraries for the databricks run\n", - "- **compute_target:** Azure Databricks compute\n", - "- **allow_reuse:** Whether the step should reuse previous results when run with the same settings/inputs\n", - "- **version:** Optional version tag to denote a change in functionality for the step\n", - "\n", - "\\* *denotes required fields* \n", - "*You must provide exactly one of num_workers or min_workers and max_workers paramaters* \n", - "*You must provide exactly one of databricks_compute or databricks_compute_name parameters*\n", - "\n", - "## Use runconfig to specify library dependencies\n", - "You can use a runconfig to specify the library dependencies for your cluster in Databricks. The runconfig will contain a databricks section as follows:\n", - "\n", - "```yaml\n", - "environment:\n", - "# Databricks details\n", - " databricks:\n", - "# List of maven libraries.\n", - " mavenLibraries:\n", - " - coordinates: org.jsoup:jsoup:1.7.1\n", - " repo: ''\n", - " exclusions:\n", - " - slf4j:slf4j\n", - " - '*:hadoop-client'\n", - "# List of PyPi libraries\n", - " pypiLibraries:\n", - " - package: beautifulsoup4\n", - " repo: ''\n", - "# List of RCran libraries\n", - " rcranLibraries:\n", - " -\n", - "# Coordinates.\n", - " package: ada\n", - "# Repo\n", - " repo: http://cran.us.r-project.org\n", - "# List of JAR libraries\n", - " jarLibraries:\n", - " -\n", - "# Coordinates.\n", - " library: dbfs:/mnt/libraries/library.jar\n", - "# List of Egg libraries\n", - " eggLibraries:\n", - " -\n", - "# Coordinates.\n", - " library: dbfs:/mnt/libraries/library.egg\n", - "```\n", - "\n", - "You can then create a RunConfiguration object using this file and pass it as the runconfig parameter to DatabricksStep.\n", - "```python\n", - "from azureml.core.runconfig import RunConfiguration\n", - "\n", - "runconfig = RunConfiguration()\n", - "runconfig.load(path='', name='')\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 1. Running the demo notebook already added to the Databricks workspace\n", - "Create a notebook in the Azure Databricks workspace, and provide the path to that notebook as the value associated with the environment variable \"DATABRICKS_NOTEBOOK_PATH\". This will then set the variable notebook_path when you run the code cell below:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "notebook_path=os.getenv(\"DATABRICKS_NOTEBOOK_PATH\", \"\") # Databricks notebook path\n", - "\n", - "dbNbStep = DatabricksStep(\n", - " name=\"DBNotebookInWS\",\n", - " inputs=[step_1_input],\n", - " outputs=[step_1_output],\n", - " num_workers=1,\n", - " notebook_path=notebook_path,\n", - " notebook_params={'myparam': 'testparam'},\n", - " run_name='DB_Notebook_demo',\n", - " compute_target=databricks_compute,\n", - " allow_reuse=True\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Build and submit the Experiment" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#PUBLISHONLY\n", - "#steps = [dbNbStep]\n", - "#pipeline = Pipeline(workspace=ws, steps=steps)\n", - "#pipeline_run = Experiment(ws, 'DB_Notebook_demo').submit(pipeline)\n", - "#pipeline_run.wait_for_completion()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### View Run Details" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#PUBLISHONLY\n", - "#from azureml.widgets import RunDetails\n", - "#RunDetails(pipeline_run).show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2. Running a Python script from DBFS\n", - "This shows how to run a Python script in DBFS. \n", - "\n", - "To complete this, you will need to first upload the Python script in your local machine to DBFS using the [CLI](https://docs.azuredatabricks.net/user-guide/dbfs-databricks-file-system.html). The CLI command is given below:\n", - "\n", - "```\n", - "dbfs cp ./train-db-dbfs.py dbfs:/train-db-dbfs.py\n", - "```\n", - "\n", - "The code in the below cell assumes that you have completed the previous step of uploading the script `train-db-dbfs.py` to the root folder in DBFS." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "python_script_path = os.getenv(\"DATABRICKS_PYTHON_SCRIPT_PATH\", \"\") # Databricks python script path\n", - "\n", - "dbPythonInDbfsStep = DatabricksStep(\n", - " name=\"DBPythonInDBFS\",\n", - " inputs=[step_1_input],\n", - " num_workers=1,\n", - " python_script_path=python_script_path,\n", - " python_script_params={'--input_data'},\n", - " run_name='DB_Python_demo',\n", - " compute_target=databricks_compute,\n", - " allow_reuse=True\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Build and submit the Experiment" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#PUBLISHONLY\n", - "#steps = [dbPythonInDbfsStep]\n", - "#pipeline = Pipeline(workspace=ws, steps=steps)\n", - "#pipeline_run = Experiment(ws, 'DB_Python_demo').submit(pipeline)\n", - "#pipeline_run.wait_for_completion()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### View Run Details" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#PUBLISHONLY\n", - "#from azureml.widgets import RunDetails\n", - "#RunDetails(pipeline_run).show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 3. Running a Python script in Databricks that currenlty is in local computer\n", - "To run a Python script that is currently in your local computer, follow the instructions below. \n", - "\n", - "The commented out code below code assumes that you have `train-db-local.py` in the `scripts` subdirectory under the current working directory.\n", - "\n", - "In this case, the Python script will be uploaded first to DBFS, and then the script will be run in Databricks." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "python_script_name = \"train-db-local.py\"\n", - "source_directory = \".\"\n", - "\n", - "dbPythonInLocalMachineStep = DatabricksStep(\n", - " name=\"DBPythonInLocalMachine\",\n", - " inputs=[step_1_input],\n", - " num_workers=1,\n", - " python_script_name=python_script_name,\n", - " source_directory=source_directory,\n", - " run_name='DB_Python_Local_demo',\n", - " compute_target=databricks_compute,\n", - " allow_reuse=True\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Build and submit the Experiment" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "steps = [dbPythonInLocalMachineStep]\n", - "pipeline = Pipeline(workspace=ws, steps=steps)\n", - "pipeline_run = Experiment(ws, 'DB_Python_Local_demo').submit(pipeline)\n", - "pipeline_run.wait_for_completion()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### View Run Details" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from azureml.widgets import RunDetails\n", - "RunDetails(pipeline_run).show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 4. Running a JAR job that is alreay added in DBFS\n", - "To run a JAR job that is already uploaded to DBFS, follow the instructions below. You will first upload the JAR file to DBFS using the [CLI](https://docs.azuredatabricks.net/user-guide/dbfs-databricks-file-system.html).\n", - "\n", - "The commented out code in the below cell assumes that you have uploaded `train-db-dbfs.jar` to the root folder in DBFS. You can upload `train-db-dbfs.jar` to the root folder in DBFS using this commandline so you can use `jar_library_dbfs_path = \"dbfs:/train-db-dbfs.jar\"`:\n", - "\n", - "```\n", - "dbfs cp ./train-db-dbfs.jar dbfs:/train-db-dbfs.jar\n", - "```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "main_jar_class_name = \"com.microsoft.aeva.Main\"\n", - "jar_library_dbfs_path = os.getenv(\"DATABRICKS_JAR_LIB_PATH\", \"\") # Databricks jar library path\n", - "\n", - "dbJarInDbfsStep = DatabricksStep(\n", - " name=\"DBJarInDBFS\",\n", - " inputs=[step_1_input],\n", - " num_workers=1,\n", - " main_class_name=main_jar_class_name,\n", - " jar_params={'arg1', 'arg2'},\n", - " run_name='DB_JAR_demo',\n", - " jar_libraries=[JarLibrary(jar_library_dbfs_path)],\n", - " compute_target=databricks_compute,\n", - " allow_reuse=True\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Build and submit the Experiment" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#PUBLISHONLY\n", - "#steps = [dbJarInDbfsStep]\n", - "#pipeline = Pipeline(workspace=ws, steps=steps)\n", - "#pipeline_run = Experiment(ws, 'DB_JAR_demo').submit(pipeline)\n", - "#pipeline_run.wait_for_completion()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### View Run Details" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#PUBLISHONLY\n", - "#from azureml.widgets import RunDetails\n", - "#RunDetails(pipeline_run).show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Next: ADLA as a Compute Target\n", - "To use ADLA as a compute target from Azure Machine Learning Pipeline, a AdlaStep is used. This [notebook](./aml-pipelines-use-adla-as-compute-target.ipynb) demonstrates the use of AdlaStep in Azure Machine Learning Pipeline." - ] - } - ], - "metadata": { - "authors": [ - { - "name": "diray" - } - ], - "kernelspec": { - "display_name": "Python 3.6", - "language": "python", - "name": "python36" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.2" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/how-to-use-azureml/azure-databricks/automl/automl-databricks-local-01.ipynb b/how-to-use-azureml/azure-databricks/automl/automl-databricks-local-01.ipynb index 379cb8bde..3ad340e5b 100644 --- a/how-to-use-azureml/azure-databricks/automl/automl-databricks-local-01.ipynb +++ b/how-to-use-azureml/azure-databricks/automl/automl-databricks-local-01.ipynb @@ -271,11 +271,14 @@ "from azureml.core import Datastore\n", "\n", "datastore_name = 'demo_training'\n", + "container_name = 'digits' \n", + "account_name = 'automlpublicdatasets'\n", "Datastore.register_azure_blob_container(\n", " workspace = ws, \n", " datastore_name = datastore_name, \n", - " container_name = 'automl-notebook-data', \n", - " account_name = 'dprepdata'\n", + " container_name = container_name, \n", + " account_name = account_name,\n", + " overwrite = True\n", ")" ] }, @@ -340,10 +343,10 @@ "import azureml.dataprep as dprep\n", "from azureml.data.datapath import DataPath\n", "\n", - "datastore = Datastore.get(workspace = ws, name = datastore_name)\n", + "datastore = Datastore.get(workspace = ws, datastore_name = datastore_name)\n", "\n", - "X_train = dprep.read_csv(DataPath(datastore = datastore, path_on_datastore = 'X.csv')) \n", - "y_train = dprep.read_csv(DataPath(datastore = datastore, path_on_datastore = 'y.csv')).to_long(dprep.ColumnSelector(term='.*', use_regex = True))" + "X_train = dprep.read_csv(datastore.path('X.csv'))\n", + "y_train = dprep.read_csv(datastore.path('y.csv')).to_long(dprep.ColumnSelector(term='.*', use_regex = True))" ] }, { @@ -407,7 +410,7 @@ " debug_log = 'automl_errors.log',\n", " primary_metric = 'AUC_weighted',\n", " iteration_timeout_minutes = 10,\n", - " iterations = 5,\n", + " iterations = 3,\n", " preprocess = True,\n", " n_cross_validations = 10,\n", " max_concurrent_iterations = 2, #change it based on number of worker nodes\n", @@ -433,7 +436,27 @@ "metadata": {}, "outputs": [], "source": [ - "local_run = experiment.submit(automl_config, show_output = False) # for higher runs please use show_output=False and use the below" + "local_run = experiment.submit(automl_config, show_output = True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Continue experiment" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "local_run.continue_experiment(iterations=2,\n", + " X=X_train, \n", + " y=y_train,\n", + " spark_context=sc,\n", + " show_output=True)" ] }, { @@ -548,11 +571,11 @@ "metadata": {}, "outputs": [], "source": [ - "from sklearn import datasets\n", - "digits = datasets.load_digits()\n", - "X_test = digits.data[:10, :]\n", - "y_test = digits.target[:10]\n", - "images = digits.images[:10]" + "blob_location = \"https://{}.blob.core.windows.net/{}\".format(account_name, container_name)\n", + "X_test = pd.read_csv(\"{}./X_valid.csv\".format(blob_location), header=0)\n", + "y_test = pd.read_csv(\"{}/y_valid.csv\".format(blob_location), header=0)\n", + "images = pd.read_csv(\"{}/images.csv\".format(blob_location), header=None)\n", + "images = np.reshape(images.values, (100,8,8))" ] }, { @@ -573,9 +596,9 @@ "for index in np.random.choice(len(y_test), 2, replace = False):\n", " print(index)\n", " predicted = fitted_model.predict(X_test[index:index + 1])[0]\n", - " label = y_test[index]\n", + " label = y_test.values[index]\n", " title = \"Label value = %d Predicted value = %d \" % (label, predicted)\n", - " fig = plt.figure(1, figsize = (3,3))\n", + " fig = plt.figure(3, figsize = (5,5))\n", " ax1 = fig.add_axes((0,0,.8,.8))\n", " ax1.set_title(title)\n", " plt.imshow(images[index], cmap = plt.cm.gray_r, interpolation = 'nearest')\n", @@ -605,7 +628,7 @@ "name": "savitam" }, { - "name": "wamartin" + "name": "sasum" } ], "kernelspec": { diff --git a/how-to-use-azureml/azure-databricks/automl/automl-databricks-local-with-deployment.ipynb b/how-to-use-azureml/azure-databricks/automl/automl-databricks-local-with-deployment.ipynb index 2254f111e..0ca2ddce8 100644 --- a/how-to-use-azureml/azure-databricks/automl/automl-databricks-local-with-deployment.ipynb +++ b/how-to-use-azureml/azure-databricks/automl/automl-databricks-local-with-deployment.ipynb @@ -288,11 +288,14 @@ "from azureml.core import Datastore\n", "\n", "datastore_name = 'demo_training'\n", + "container_name = 'digits' \n", + "account_name = 'automlpublicdatasets'\n", "Datastore.register_azure_blob_container(\n", " workspace = ws, \n", " datastore_name = datastore_name, \n", - " container_name = 'automl-notebook-data', \n", - " account_name = 'dprepdata'\n", + " container_name = container_name, \n", + " account_name = account_name,\n", + " overwrite = True\n", ")" ] }, @@ -357,10 +360,10 @@ "import azureml.dataprep as dprep\n", "from azureml.data.datapath import DataPath\n", "\n", - "datastore = Datastore.get(workspace = ws, name = datastore_name)\n", + "datastore = Datastore.get(workspace = ws, datastore_name = datastore_name)\n", "\n", - "X_train = dprep.read_csv(DataPath(datastore = datastore, path_on_datastore = 'X.csv')) \n", - "y_train = dprep.read_csv(DataPath(datastore = datastore, path_on_datastore = 'y.csv')).to_long(dprep.ColumnSelector(term='.*', use_regex = True))" + "X_train = dprep.read_csv(datastore.path('X.csv'))\n", + "y_train = dprep.read_csv(datastore.path('y.csv')).to_long(dprep.ColumnSelector(term='.*', use_regex = True))" ] }, { @@ -450,7 +453,7 @@ "metadata": {}, "outputs": [], "source": [ - "local_run = experiment.submit(automl_config, show_output = False) # for higher runs please use show_output=False and use the below" + "local_run = experiment.submit(automl_config, show_output = True)" ] }, { @@ -720,11 +723,11 @@ "metadata": {}, "outputs": [], "source": [ - "from sklearn import datasets\n", - "digits = datasets.load_digits()\n", - "X_test = digits.data[:10, :]\n", - "y_test = digits.target[:10]\n", - "images = digits.images[:10]" + "blob_location = \"https://{}.blob.core.windows.net/{}\".format(account_name, container_name)\n", + "X_test = pd.read_csv(\"{}./X_valid.csv\".format(blob_location), header=0)\n", + "y_test = pd.read_csv(\"{}/y_valid.csv\".format(blob_location), header=0)\n", + "images = pd.read_csv(\"{}/images.csv\".format(blob_location), header=None)\n", + "images = np.reshape(images.values, (100,8,8))" ] }, { @@ -745,9 +748,9 @@ "for index in np.random.choice(len(y_test), 2, replace = False):\n", " print(index)\n", " predicted = fitted_model.predict(X_test[index:index + 1])[0]\n", - " label = y_test[index]\n", + " label = y_test.values[index]\n", " title = \"Label value = %d Predicted value = %d \" % (label, predicted)\n", - " fig = plt.figure(1, figsize = (3,3))\n", + " fig = plt.figure(3, figsize = (5,5))\n", " ax1 = fig.add_axes((0,0,.8,.8))\n", " ax1.set_title(title)\n", " plt.imshow(images[index], cmap = plt.cm.gray_r, interpolation = 'nearest')\n", @@ -761,7 +764,7 @@ "name": "savitam" }, { - "name": "wamartin" + "name": "sasum" } ], "kernelspec": { diff --git a/how-to-use-azureml/azure-databricks/databricks-as-remote-compute-target/README.md b/how-to-use-azureml/azure-databricks/databricks-as-remote-compute-target/README.md new file mode 100644 index 000000000..440463bb6 --- /dev/null +++ b/how-to-use-azureml/azure-databricks/databricks-as-remote-compute-target/README.md @@ -0,0 +1,16 @@ +# Using Databricks as a Compute Target from Azure Machine Learning Pipeline +To use Databricks as a compute target from Azure Machine Learning Pipeline, a DatabricksStep is used. This notebook demonstrates the use of DatabricksStep in Azure Machine Learning Pipeline. + +The notebook will show: + +1. Running an arbitrary Databricks notebook that the customer has in Databricks workspace +2. Running an arbitrary Python script that the customer has in DBFS +3. Running an arbitrary Python script that is available on local computer (will upload to DBFS, and then run in Databricks) +4. Running a JAR job that the customer has in DBFS. + +## Before you begin: +1. **Create an Azure Databricks workspace** in the same subscription where you have your Azure Machine Learning workspace. +You will need details of this workspace later on to define DatabricksStep. [More information](https://ms.portal.azure.com/#blade/HubsExtension/Resources/resourceType/Microsoft.Databricks%2Fworkspaces). +2. **Create PAT (access token)** at the Azure Databricks portal. [More information](https://docs.databricks.com/api/latest/authentication.html#generate-a-token). +3. **Add demo notebook to ADB** This notebook has a sample you can use as is. Launch Azure Databricks attached to your Azure Machine Learning workspace and add a new notebook. +4. **Create/attach a Blob storage** for use from ADB \ No newline at end of file diff --git a/how-to-use-azureml/azure-databricks/databricks-as-remote-compute-target/aml-pipelines-use-databricks-as-compute-target.ipynb b/how-to-use-azureml/azure-databricks/databricks-as-remote-compute-target/aml-pipelines-use-databricks-as-compute-target.ipynb new file mode 100644 index 000000000..4b92d2111 --- /dev/null +++ b/how-to-use-azureml/azure-databricks/databricks-as-remote-compute-target/aml-pipelines-use-databricks-as-compute-target.ipynb @@ -0,0 +1,708 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Copyright (c) Microsoft Corporation. All rights reserved. \n", + "Licensed under the MIT License." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Using Databricks as a Compute Target from Azure Machine Learning Pipeline\n", + "To use Databricks as a compute target from [Azure Machine Learning Pipeline](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines), a [DatabricksStep](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.databricks_step.databricksstep?view=azure-ml-py) is used. This notebook demonstrates the use of DatabricksStep in Azure Machine Learning Pipeline.\n", + "\n", + "The notebook will show:\n", + "1. Running an arbitrary Databricks notebook that the customer has in Databricks workspace\n", + "2. Running an arbitrary Python script that the customer has in DBFS\n", + "3. Running an arbitrary Python script that is available on local computer (will upload to DBFS, and then run in Databricks) \n", + "4. Running a JAR job that the customer has in DBFS.\n", + "\n", + "## Before you begin:\n", + "\n", + "1. **Create an Azure Databricks workspace** in the same subscription where you have your Azure Machine Learning workspace. You will need details of this workspace later on to define DatabricksStep. [Click here](https://ms.portal.azure.com/#blade/HubsExtension/Resources/resourceType/Microsoft.Databricks%2Fworkspaces) for more information.\n", + "2. **Create PAT (access token)**: Manually create a Databricks access token at the Azure Databricks portal. See [this](https://docs.databricks.com/api/latest/authentication.html#generate-a-token) for more information.\n", + "3. **Add demo notebook to ADB**: This notebook has a sample you can use as is. Launch Azure Databricks attached to your Azure Machine Learning workspace and add a new notebook. \n", + "4. **Create/attach a Blob storage** for use from ADB" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Add demo notebook to ADB Workspace\n", + "Copy and paste the below code to create a new notebook in your ADB workspace." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```python\n", + "# direct access\n", + "dbutils.widgets.get(\"myparam\")\n", + "p = getArgument(\"myparam\")\n", + "print (\"Param -\\'myparam':\")\n", + "print (p)\n", + "\n", + "dbutils.widgets.get(\"input\")\n", + "i = getArgument(\"input\")\n", + "print (\"Param -\\'input':\")\n", + "print (i)\n", + "\n", + "dbutils.widgets.get(\"output\")\n", + "o = getArgument(\"output\")\n", + "print (\"Param -\\'output':\")\n", + "print (o)\n", + "\n", + "n = i + \"/testdata.txt\"\n", + "df = spark.read.csv(n)\n", + "\n", + "display (df)\n", + "\n", + "data = [('value1', 'value2')]\n", + "df2 = spark.createDataFrame(data)\n", + "\n", + "z = o + \"/output.txt\"\n", + "df2.write.csv(z)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Azure Machine Learning and Pipeline SDK-specific imports" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import azureml.core\n", + "from azureml.core.runconfig import JarLibrary\n", + "from azureml.core.compute import ComputeTarget, DatabricksCompute\n", + "from azureml.exceptions import ComputeTargetException\n", + "from azureml.core import Workspace, Experiment\n", + "from azureml.pipeline.core import Pipeline, PipelineData\n", + "from azureml.pipeline.steps import DatabricksStep\n", + "from azureml.core.datastore import Datastore\n", + "from azureml.data.data_reference import DataReference\n", + "\n", + "# Check core SDK version number\n", + "print(\"SDK version:\", azureml.core.VERSION)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Initialize Workspace\n", + "\n", + "Initialize a workspace object from persisted configuration. Make sure the config file is present at .\\config.json" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ws = Workspace.from_config()\n", + "print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\\n')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Attach Databricks compute target\n", + "Next, you need to add your Databricks workspace to Azure Machine Learning as a compute target and give it a name. You will use this name to refer to your Databricks workspace compute target inside Azure Machine Learning.\n", + "\n", + "- **Resource Group** - The resource group name of your Azure Machine Learning workspace\n", + "- **Databricks Workspace Name** - The workspace name of your Azure Databricks workspace\n", + "- **Databricks Access Token** - The access token you created in ADB\n", + "\n", + "**The Databricks workspace need to be present in the same subscription as your AML workspace**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Replace with your account info before running.\n", + " \n", + "db_compute_name=os.getenv(\"DATABRICKS_COMPUTE_NAME\", \"\") # Databricks compute name\n", + "db_resource_group=os.getenv(\"DATABRICKS_RESOURCE_GROUP\", \"\") # Databricks resource group\n", + "db_workspace_name=os.getenv(\"DATABRICKS_WORKSPACE_NAME\", \"\") # Databricks workspace name\n", + "db_access_token=os.getenv(\"DATABRICKS_ACCESS_TOKEN\", \"\") # Databricks access token\n", + " \n", + "try:\n", + " databricks_compute = DatabricksCompute(workspace=ws, name=db_compute_name)\n", + " print('Compute target {} already exists'.format(db_compute_name))\n", + "except ComputeTargetException:\n", + " print('Compute not found, will use below parameters to attach new one')\n", + " print('db_compute_name {}'.format(db_compute_name))\n", + " print('db_resource_group {}'.format(db_resource_group))\n", + " print('db_workspace_name {}'.format(db_workspace_name))\n", + " print('db_access_token {}'.format(db_access_token))\n", + " \n", + " config = DatabricksCompute.attach_configuration(\n", + " resource_group = db_resource_group,\n", + " workspace_name = db_workspace_name,\n", + " access_token= db_access_token)\n", + " databricks_compute=ComputeTarget.attach(ws, db_compute_name, config)\n", + " databricks_compute.wait_for_completion(True)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data Connections with Inputs and Outputs\n", + "The DatabricksStep supports Azure Bloband ADLS for inputs and outputs. You also will need to define a [Secrets](https://docs.azuredatabricks.net/user-guide/secrets/index.html) scope to enable authentication to external data sources such as Blob and ADLS from Databricks.\n", + "\n", + "- Databricks documentation on [Azure Blob](https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-storage.html)\n", + "- Databricks documentation on [ADLS](https://docs.databricks.com/spark/latest/data-sources/azure/azure-datalake.html)\n", + "\n", + "### Type of Data Access\n", + "Databricks allows to interact with Azure Blob and ADLS in two ways.\n", + "- **Direct Access**: Databricks allows you to interact with Azure Blob or ADLS URIs directly. The input or output URIs will be mapped to a Databricks widget param in the Databricks notebook.\n", + "- **Mounting**: You will be supplied with additional parameters and secrets that will enable you to mount your ADLS or Azure Blob input or output location in your Databricks notebook." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Direct Access: Python sample code\n", + "If you have a data reference named \"input\" it will represent the URI of the input and you can access it directly in the Databricks python notebook like so:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```python\n", + "dbutils.widgets.get(\"input\")\n", + "y = getArgument(\"input\")\n", + "df = spark.read.csv(y)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Mounting: Python sample code for Azure Blob\n", + "Given an Azure Blob data reference named \"input\" the following widget params will be made available in the Databricks notebook:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```python\n", + "# This contains the input URI\n", + "dbutils.widgets.get(\"input\")\n", + "myinput_uri = getArgument(\"input\")\n", + "\n", + "# How to get the input datastore name inside ADB notebook\n", + "# This contains the name of a Databricks secret (in the predefined \"amlscope\" secret scope) \n", + "# that contians an access key or sas for the Azure Blob input (this name is obtained by appending \n", + "# the name of the input with \"_blob_secretname\". \n", + "dbutils.widgets.get(\"input_blob_secretname\") \n", + "myinput_blob_secretname = getArgument(\"input_blob_secretname\")\n", + "\n", + "# This contains the required configuration for mounting\n", + "dbutils.widgets.get(\"input_blob_config\")\n", + "myinput_blob_config = getArgument(\"input_blob_config\")\n", + "\n", + "# Usage\n", + "dbutils.fs.mount(\n", + " source = myinput_uri,\n", + " mount_point = \"/mnt/input\",\n", + " extra_configs = {myinput_blob_config:dbutils.secrets.get(scope = \"amlscope\", key = myinput_blob_secretname)})\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Mounting: Python sample code for ADLS\n", + "Given an ADLS data reference named \"input\" the following widget params will be made available in the Databricks notebook:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```python\n", + "# This contains the input URI\n", + "dbutils.widgets.get(\"input\") \n", + "myinput_uri = getArgument(\"input\")\n", + "\n", + "# This contains the client id for the service principal \n", + "# that has access to the adls input\n", + "dbutils.widgets.get(\"input_adls_clientid\") \n", + "myinput_adls_clientid = getArgument(\"input_adls_clientid\")\n", + "\n", + "# This contains the name of a Databricks secret (in the predefined \"amlscope\" secret scope) \n", + "# that contains the secret for the above mentioned service principal\n", + "dbutils.widgets.get(\"input_adls_secretname\") \n", + "myinput_adls_secretname = getArgument(\"input_adls_secretname\")\n", + "\n", + "# This contains the refresh url for the mounting configs\n", + "dbutils.widgets.get(\"input_adls_refresh_url\") \n", + "myinput_adls_refresh_url = getArgument(\"input_adls_refresh_url\")\n", + "\n", + "# Usage \n", + "configs = {\"dfs.adls.oauth2.access.token.provider.type\": \"ClientCredential\",\n", + " \"dfs.adls.oauth2.client.id\": myinput_adls_clientid,\n", + " \"dfs.adls.oauth2.credential\": dbutils.secrets.get(scope = \"amlscope\", key =myinput_adls_secretname),\n", + " \"dfs.adls.oauth2.refresh.url\": myinput_adls_refresh_url}\n", + "\n", + "dbutils.fs.mount(\n", + " source = myinput_uri,\n", + " mount_point = \"/mnt/output\",\n", + " extra_configs = configs)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Use Databricks from Azure Machine Learning Pipeline\n", + "To use Databricks as a compute target from Azure Machine Learning Pipeline, a DatabricksStep is used. Let's define a datasource (via DataReference) and intermediate data (via PipelineData) to be used in DatabricksStep." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Use the default blob storage\n", + "def_blob_store = Datastore(ws, \"workspaceblobstore\")\n", + "print('Datastore {} will be used'.format(def_blob_store.name))\n", + "\n", + "# We are uploading a sample file in the local directory to be used as a datasource\n", + "def_blob_store.upload_files(files=[\"./testdata.txt\"], target_path=\"dbtest\", overwrite=False)\n", + "\n", + "step_1_input = DataReference(datastore=def_blob_store, path_on_datastore=\"dbtest\",\n", + " data_reference_name=\"input\")\n", + "\n", + "step_1_output = PipelineData(\"output\", datastore=def_blob_store)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Add a DatabricksStep\n", + "Adds a Databricks notebook as a step in a Pipeline.\n", + "- ***name:** Name of the Module\n", + "- **inputs:** List of input connections for data consumed by this step. Fetch this inside the notebook using dbutils.widgets.get(\"input\")\n", + "- **outputs:** List of output port definitions for outputs produced by this step. Fetch this inside the notebook using dbutils.widgets.get(\"output\")\n", + "- **existing_cluster_id:** Cluster ID of an existing Interactive cluster on the Databricks workspace. If you are providing this, do not provide any of the parameters below that are used to create a new cluster such as spark_version, node_type, etc.\n", + "- **spark_version:** Version of spark for the databricks run cluster. default value: 4.0.x-scala2.11\n", + "- **node_type:** Azure vm node types for the databricks run cluster. default value: Standard_D3_v2\n", + "- **num_workers:** Specifies a static number of workers for the databricks run cluster\n", + "- **min_workers:** Specifies a min number of workers to use for auto-scaling the databricks run cluster\n", + "- **max_workers:** Specifies a max number of workers to use for auto-scaling the databricks run cluster\n", + "- **spark_env_variables:** Spark environment variables for the databricks run cluster (dictionary of {str:str}). default value: {'PYSPARK_PYTHON': '/databricks/python3/bin/python3'}\n", + "- **notebook_path:** Path to the notebook in the databricks instance. If you are providing this, do not provide python script related paramaters or JAR related parameters.\n", + "- **notebook_params:** Parameters for the databricks notebook (dictionary of {str:str}). Fetch this inside the notebook using dbutils.widgets.get(\"myparam\")\n", + "- **python_script_path:** The path to the python script in the DBFS or S3. If you are providing this, do not provide python_script_name which is used for uploading script from local machine.\n", + "- **python_script_params:** Parameters for the python script (list of str)\n", + "- **main_class_name:** The name of the entry point in a JAR module. If you are providing this, do not provide any python script or notebook related parameters.\n", + "- **jar_params:** Parameters for the JAR module (list of str)\n", + "- **python_script_name:** name of a python script on your local machine (relative to source_directory). If you are providing this do not provide python_script_path which is used to execute a remote python script; or any of the JAR or notebook related parameters.\n", + "- **source_directory:** folder that contains the script and other files\n", + "- **hash_paths:** list of paths to hash to detect a change in source_directory (script file is always hashed)\n", + "- **run_name:** Name in databricks for this run\n", + "- **timeout_seconds:** Timeout for the databricks run\n", + "- **runconfig:** Runconfig to use. Either pass runconfig or each library type as a separate parameter but do not mix the two\n", + "- **maven_libraries:** maven libraries for the databricks run\n", + "- **pypi_libraries:** pypi libraries for the databricks run\n", + "- **egg_libraries:** egg libraries for the databricks run\n", + "- **jar_libraries:** jar libraries for the databricks run\n", + "- **rcran_libraries:** rcran libraries for the databricks run\n", + "- **compute_target:** Azure Databricks compute\n", + "- **allow_reuse:** Whether the step should reuse previous results when run with the same settings/inputs\n", + "- **version:** Optional version tag to denote a change in functionality for the step\n", + "\n", + "\\* *denotes required fields* \n", + "*You must provide exactly one of num_workers or min_workers and max_workers paramaters* \n", + "*You must provide exactly one of databricks_compute or databricks_compute_name parameters*\n", + "\n", + "## Use runconfig to specify library dependencies\n", + "You can use a runconfig to specify the library dependencies for your cluster in Databricks. The runconfig will contain a databricks section as follows:\n", + "\n", + "```yaml\n", + "environment:\n", + "# Databricks details\n", + " databricks:\n", + "# List of maven libraries.\n", + " mavenLibraries:\n", + " - coordinates: org.jsoup:jsoup:1.7.1\n", + " repo: ''\n", + " exclusions:\n", + " - slf4j:slf4j\n", + " - '*:hadoop-client'\n", + "# List of PyPi libraries\n", + " pypiLibraries:\n", + " - package: beautifulsoup4\n", + " repo: ''\n", + "# List of RCran libraries\n", + " rcranLibraries:\n", + " -\n", + "# Coordinates.\n", + " package: ada\n", + "# Repo\n", + " repo: http://cran.us.r-project.org\n", + "# List of JAR libraries\n", + " jarLibraries:\n", + " -\n", + "# Coordinates.\n", + " library: dbfs:/mnt/libraries/library.jar\n", + "# List of Egg libraries\n", + " eggLibraries:\n", + " -\n", + "# Coordinates.\n", + " library: dbfs:/mnt/libraries/library.egg\n", + "```\n", + "\n", + "You can then create a RunConfiguration object using this file and pass it as the runconfig parameter to DatabricksStep.\n", + "```python\n", + "from azureml.core.runconfig import RunConfiguration\n", + "\n", + "runconfig = RunConfiguration()\n", + "runconfig.load(path='', name='')\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 1. Running the demo notebook already added to the Databricks workspace\n", + "Create a notebook in the Azure Databricks workspace, and provide the path to that notebook as the value associated with the environment variable \"DATABRICKS_NOTEBOOK_PATH\". This will then set the variable\u00c2\u00a0notebook_path\u00c2\u00a0when you run the code cell below:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "notebook_path=os.getenv(\"DATABRICKS_NOTEBOOK_PATH\", \"\") # Databricks notebook path\n", + "\n", + "dbNbStep = DatabricksStep(\n", + " name=\"DBNotebookInWS\",\n", + " inputs=[step_1_input],\n", + " outputs=[step_1_output],\n", + " num_workers=1,\n", + " notebook_path=notebook_path,\n", + " notebook_params={'myparam': 'testparam'},\n", + " run_name='DB_Notebook_demo',\n", + " compute_target=databricks_compute,\n", + " allow_reuse=True\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Build and submit the Experiment" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "steps = [dbNbStep]\n", + "pipeline = Pipeline(workspace=ws, steps=steps)\n", + "pipeline_run = Experiment(ws, 'DB_Notebook_demo').submit(pipeline)\n", + "pipeline_run.wait_for_completion()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### View Run Details" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azureml.widgets import RunDetails\n", + "RunDetails(pipeline_run).show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 2. Running a Python script from DBFS\n", + "This shows how to run a Python script in DBFS. \n", + "\n", + "To complete this, you will need to first upload the Python script in your local machine to DBFS using the [CLI](https://docs.azuredatabricks.net/user-guide/dbfs-databricks-file-system.html). The CLI command is given below:\n", + "\n", + "```\n", + "dbfs cp ./train-db-dbfs.py dbfs:/train-db-dbfs.py\n", + "```\n", + "\n", + "The code in the below cell assumes that you have completed the previous step of uploading the script `train-db-dbfs.py` to the root folder in DBFS." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "python_script_path = os.getenv(\"DATABRICKS_PYTHON_SCRIPT_PATH\", \"\") # Databricks python script path\n", + "\n", + "dbPythonInDbfsStep = DatabricksStep(\n", + " name=\"DBPythonInDBFS\",\n", + " inputs=[step_1_input],\n", + " num_workers=1,\n", + " python_script_path=python_script_path,\n", + " python_script_params={'--input_data'},\n", + " run_name='DB_Python_demo',\n", + " compute_target=databricks_compute,\n", + " allow_reuse=True\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Build and submit the Experiment" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "steps = [dbPythonInDbfsStep]\n", + "pipeline = Pipeline(workspace=ws, steps=steps)\n", + "pipeline_run = Experiment(ws, 'DB_Python_demo').submit(pipeline)\n", + "pipeline_run.wait_for_completion()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### View Run Details" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azureml.widgets import RunDetails\n", + "RunDetails(pipeline_run).show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 3. Running a Python script in Databricks that currenlty is in local computer\n", + "To run a Python script that is currently in your local computer, follow the instructions below. \n", + "\n", + "The commented out code below code assumes that you have `train-db-local.py` in the `scripts` subdirectory under the current working directory.\n", + "\n", + "In this case, the Python script will be uploaded first to DBFS, and then the script will be run in Databricks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "python_script_name = \"train-db-local.py\"\n", + "source_directory = \".\"\n", + "\n", + "dbPythonInLocalMachineStep = DatabricksStep(\n", + " name=\"DBPythonInLocalMachine\",\n", + " inputs=[step_1_input],\n", + " num_workers=1,\n", + " python_script_name=python_script_name,\n", + " source_directory=source_directory,\n", + " run_name='DB_Python_Local_demo',\n", + " compute_target=databricks_compute,\n", + " allow_reuse=True\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Build and submit the Experiment" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "steps = [dbPythonInLocalMachineStep]\n", + "pipeline = Pipeline(workspace=ws, steps=steps)\n", + "pipeline_run = Experiment(ws, 'DB_Python_Local_demo').submit(pipeline)\n", + "pipeline_run.wait_for_completion()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### View Run Details" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azureml.widgets import RunDetails\n", + "RunDetails(pipeline_run).show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 4. Running a JAR job that is alreay added in DBFS\n", + "To run a JAR job that is already uploaded to DBFS, follow the instructions below. You will first upload the JAR file to DBFS using the [CLI](https://docs.azuredatabricks.net/user-guide/dbfs-databricks-file-system.html).\n", + "\n", + "The commented out code in the below cell assumes that you have uploaded `train-db-dbfs.jar` to the root folder in DBFS. You can upload `train-db-dbfs.jar` to the root folder in DBFS using this commandline so you can use `jar_library_dbfs_path = \"dbfs:/train-db-dbfs.jar\"`:\n", + "\n", + "```\n", + "dbfs cp ./train-db-dbfs.jar dbfs:/train-db-dbfs.jar\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "main_jar_class_name = \"com.microsoft.aeva.Main\"\n", + "jar_library_dbfs_path = os.getenv(\"DATABRICKS_JAR_LIB_PATH\", \"\") # Databricks jar library path\n", + "\n", + "dbJarInDbfsStep = DatabricksStep(\n", + " name=\"DBJarInDBFS\",\n", + " inputs=[step_1_input],\n", + " num_workers=1,\n", + " main_class_name=main_jar_class_name,\n", + " jar_params={'arg1', 'arg2'},\n", + " run_name='DB_JAR_demo',\n", + " jar_libraries=[JarLibrary(jar_library_dbfs_path)],\n", + " compute_target=databricks_compute,\n", + " allow_reuse=True\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Build and submit the Experiment" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "steps = [dbJarInDbfsStep]\n", + "pipeline = Pipeline(workspace=ws, steps=steps)\n", + "pipeline_run = Experiment(ws, 'DB_JAR_demo').submit(pipeline)\n", + "pipeline_run.wait_for_completion()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### View Run Details" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azureml.widgets import RunDetails\n", + "RunDetails(pipeline_run).show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Next: ADLA as a Compute Target\n", + "To use ADLA as a compute target from Azure Machine Learning Pipeline, a AdlaStep is used. This [notebook](./aml-pipelines-use-adla-as-compute-target.ipynb) demonstrates the use of AdlaStep in Azure Machine Learning Pipeline." + ] + } + ], + "metadata": { + "authors": [ + { + "name": "diray" + } + ], + "kernelspec": { + "display_name": "Python 3.6", + "language": "python", + "name": "python36" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} \ No newline at end of file diff --git a/how-to-use-azureml/azure-databricks/testdata.txt b/how-to-use-azureml/azure-databricks/databricks-as-remote-compute-target/testdata.txt similarity index 100% rename from how-to-use-azureml/azure-databricks/testdata.txt rename to how-to-use-azureml/azure-databricks/databricks-as-remote-compute-target/testdata.txt diff --git a/how-to-use-azureml/azure-databricks/train-db-dbfs.py b/how-to-use-azureml/azure-databricks/databricks-as-remote-compute-target/train-db-dbfs.py similarity index 100% rename from how-to-use-azureml/azure-databricks/train-db-dbfs.py rename to how-to-use-azureml/azure-databricks/databricks-as-remote-compute-target/train-db-dbfs.py diff --git a/how-to-use-azureml/azure-databricks/train-db-local.py b/how-to-use-azureml/azure-databricks/databricks-as-remote-compute-target/train-db-local.py similarity index 100% rename from how-to-use-azureml/azure-databricks/train-db-local.py rename to how-to-use-azureml/azure-databricks/databricks-as-remote-compute-target/train-db-local.py diff --git a/how-to-use-azureml/deployment/enable-app-insights-in-production-service/enable-app-insights-in-production-service.ipynb b/how-to-use-azureml/deployment/enable-app-insights-in-production-service/enable-app-insights-in-production-service.ipynb index 48108ff67..9f2930015 100644 --- a/how-to-use-azureml/deployment/enable-app-insights-in-production-service/enable-app-insights-in-production-service.ipynb +++ b/how-to-use-azureml/deployment/enable-app-insights-in-production-service/enable-app-insights-in-production-service.ipynb @@ -263,7 +263,7 @@ " prediction = aci_service.run(input_data=test_sample)\n", " print(prediction)\n", "else:\n", - " raise ValueError(\"Service deployment isn't healthy, can't call the service\")" + " raise ValueError(\"Service deployment isn't healthy, can't call the service. Error: \", aci_service.error)" ] }, { @@ -381,7 +381,7 @@ " aks_service.wait_for_deployment(show_output = True)\n", " print(aks_service.state)\n", "else:\n", - " raise ValueError(\"AKS provisioning failed.\")" + " raise ValueError(\"AKS provisioning failed. Error: \", aks_service.error)" ] }, { @@ -409,7 +409,7 @@ " prediction = aks_service.run(input_data=test_sample)\n", " print(prediction)\n", "else:\n", - " raise ValueError(\"Service deployment isn't healthy, can't call the service\")" + " raise ValueError(\"Service deployment isn't healthy, can't call the service. Error: \", aks_service.error)" ] }, { @@ -465,7 +465,7 @@ "metadata": { "authors": [ { - "name": "jocier" + "name": "shipatel" } ], "kernelspec": { diff --git a/how-to-use-azureml/deployment/enable-data-collection-for-models-in-aks/enable-data-collection-for-models-in-aks.ipynb b/how-to-use-azureml/deployment/enable-data-collection-for-models-in-aks/enable-data-collection-for-models-in-aks.ipynb index cf9591bd5..e12e2ab66 100644 --- a/how-to-use-azureml/deployment/enable-data-collection-for-models-in-aks/enable-data-collection-for-models-in-aks.ipynb +++ b/how-to-use-azureml/deployment/enable-data-collection-for-models-in-aks/enable-data-collection-for-models-in-aks.ipynb @@ -329,7 +329,7 @@ " aks_service.wait_for_deployment(show_output = True)\n", " print(aks_service.state)\n", "else: \n", - " raise ValueError(\"aks provisioning failed, can't deploy service\")" + " raise ValueError(\"aks provisioning failed, can't deploy service. Error: \", aks_service.error)" ] }, { @@ -362,7 +362,7 @@ " prediction = aks_service.run(input_data=test_sample)\n", " print(prediction)\n", "else:\n", - " raise ValueError(\"Service deployment isn't healthy, can't call the service\")" + " raise ValueError(\"Service deployment isn't healthy, can't call the service. Error: \", aks_service.error)" ] }, { @@ -445,7 +445,7 @@ "metadata": { "authors": [ { - "name": "jocier" + "name": "shipatel" } ], "kernelspec": { diff --git a/how-to-use-azureml/deployment/onnx/Dockerfile b/how-to-use-azureml/deployment/onnx/Dockerfile new file mode 100644 index 000000000..2db89a25d --- /dev/null +++ b/how-to-use-azureml/deployment/onnx/Dockerfile @@ -0,0 +1,2 @@ +RUN apt-get update +RUN apt-get install -y libgomp1 diff --git a/how-to-use-azureml/deployment/onnx/onnx-convert-aml-deploy-tinyyolo.ipynb b/how-to-use-azureml/deployment/onnx/onnx-convert-aml-deploy-tinyyolo.ipynb index 6cc710683..fe3b7001e 100644 --- a/how-to-use-azureml/deployment/onnx/onnx-convert-aml-deploy-tinyyolo.ipynb +++ b/how-to-use-azureml/deployment/onnx/onnx-convert-aml-deploy-tinyyolo.ipynb @@ -272,6 +272,7 @@ "image_config = ContainerImage.image_configuration(execution_script = \"score.py\",\n", " runtime = \"python\",\n", " conda_file = \"myenv.yml\",\n", + " docker_file = \"Dockerfile\",\n", " description = \"TinyYOLO ONNX Demo\",\n", " tags = {\"demo\": \"onnx\"}\n", " )\n", diff --git a/how-to-use-azureml/deployment/onnx/onnx-inference-facial-expression-recognition-deploy.ipynb b/how-to-use-azureml/deployment/onnx/onnx-inference-facial-expression-recognition-deploy.ipynb index fc6377b0c..b94ea70b9 100644 --- a/how-to-use-azureml/deployment/onnx/onnx-inference-facial-expression-recognition-deploy.ipynb +++ b/how-to-use-azureml/deployment/onnx/onnx-inference-facial-expression-recognition-deploy.ipynb @@ -350,6 +350,7 @@ "image_config = ContainerImage.image_configuration(execution_script = \"score.py\",\n", " runtime = \"python\",\n", " conda_file = \"myenv.yml\",\n", + " docker_file = \"Dockerfile\",\n", " description = \"Emotion ONNX Runtime container\",\n", " tags = {\"demo\": \"onnx\"})\n", "\n", @@ -772,7 +773,7 @@ "- ensured that your deep learning model is working perfectly (in the cloud) on test data, and checked it against some of your own!\n", "\n", "Next steps:\n", - "- If you have not already, check out another interesting ONNX/AML application that lets you set up a state-of-the-art [handwritten image classification model (MNIST)](https://github.com/Azure/MachineLearningNotebooks/tree/master/onnx/onnx-inference-mnist.ipynb) in the cloud! This tutorial deploys a pre-trained ONNX Computer Vision model for handwritten digit classification in an Azure ML virtual machine.\n", + "- If you have not already, check out another interesting ONNX/AML application that lets you set up a state-of-the-art [handwritten image classification model (MNIST)](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/deployment/onnx/onnx-inference-mnist-deploy.ipynb) in the cloud! This tutorial deploys a pre-trained ONNX Computer Vision model for handwritten digit classification in an Azure ML virtual machine.\n", "- Keep an eye out for an updated version of this tutorial that uses ONNX Runtime GPU.\n", "- Contribute to our [open source ONNX repository on github](http://github.com/onnx/onnx) and/or add to our [ONNX model zoo](http://github.com/onnx/models)" ] diff --git a/how-to-use-azureml/deployment/onnx/onnx-inference-mnist-deploy.ipynb b/how-to-use-azureml/deployment/onnx/onnx-inference-mnist-deploy.ipynb index 6bd83c2fa..fa12f6ac3 100644 --- a/how-to-use-azureml/deployment/onnx/onnx-inference-mnist-deploy.ipynb +++ b/how-to-use-azureml/deployment/onnx/onnx-inference-mnist-deploy.ipynb @@ -333,6 +333,7 @@ "image_config = ContainerImage.image_configuration(execution_script = \"score.py\",\n", " runtime = \"python\",\n", " conda_file = \"myenv.yml\",\n", + " docker_file = \"Dockerfile\",\n", " description = \"MNIST ONNX Runtime container\",\n", " tags = {\"demo\": \"onnx\"}) \n", "\n", @@ -777,7 +778,7 @@ "- ensured that your deep learning model is working perfectly (in the cloud) on test data, and checked it against some of your own!\n", "\n", "Next steps:\n", - "- Check out another interesting application based on a Microsoft Research computer vision paper that lets you set up a [facial emotion recognition model](https://github.com/Azure/MachineLearningNotebooks/tree/master/onnx/onnx-inference-emotion-recognition.ipynb) in the cloud! This tutorial deploys a pre-trained ONNX Computer Vision model in an Azure ML virtual machine.\n", + "- Check out another interesting application based on a Microsoft Research computer vision paper that lets you set up a [facial emotion recognition model](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/deployment/onnx/onnx-inference-facial-expression-recognition-deploy.ipynb) in the cloud! This tutorial deploys a pre-trained ONNX Computer Vision model in an Azure ML virtual machine.\n", "- Contribute to our [open source ONNX repository on github](http://github.com/onnx/onnx) and/or add to our [ONNX model zoo](http://github.com/onnx/models)" ] } diff --git a/how-to-use-azureml/deployment/onnx/onnx-modelzoo-aml-deploy-resnet50.ipynb b/how-to-use-azureml/deployment/onnx/onnx-modelzoo-aml-deploy-resnet50.ipynb index 099948272..781d25c28 100644 --- a/how-to-use-azureml/deployment/onnx/onnx-modelzoo-aml-deploy-resnet50.ipynb +++ b/how-to-use-azureml/deployment/onnx/onnx-modelzoo-aml-deploy-resnet50.ipynb @@ -256,6 +256,7 @@ "image_config = ContainerImage.image_configuration(execution_script = \"score.py\",\n", " runtime = \"python\",\n", " conda_file = \"myenv.yml\",\n", + " docker_file = \"Dockerfile\",\n", " description = \"ONNX ResNet50 Demo\",\n", " tags = {\"demo\": \"onnx\"}\n", " )\n", diff --git a/how-to-use-azureml/explain-model/README.md b/how-to-use-azureml/explain-model/README.md new file mode 100644 index 000000000..1d9621362 --- /dev/null +++ b/how-to-use-azureml/explain-model/README.md @@ -0,0 +1,11 @@ +## Using explain model APIs + +Follow these sample notebooks to learn: + +1. [Explain tabular data](explain-tabular-data): Basic example of explaining model trained on tabular data. +2. [Explain local classification](explain-local-sklearn-classification): Explain a scikit-learn classification model. +3. [Explain local regression](explain-local-sklearn-regression): Explain a scikit-learn regression model. +4. [Explain on remote AMLCompute](explain-on-amlcompute): Explain a model on a remote AMLCompute target. +5. [Explain classification using Run History](explain-run-history-sklearn-classification): Explain a scikit-learn classification model with Run History. +6. [Explain regression using Run History](explain-run-history-sklearn-regression): Explain a scikit-learn regression model with Run History. +7. [Explain scikit-learn raw features](explain-sklearn-raw-features): Explain the raw features of a trained scikit-learn model. diff --git a/how-to-use-azureml/explain-model/explain-local-sklearn-classification/explain-local-sklearn-classification.ipynb b/how-to-use-azureml/explain-model/explain-local-sklearn-classification/explain-local-sklearn-classification.ipynb new file mode 100644 index 000000000..441856729 --- /dev/null +++ b/how-to-use-azureml/explain-model/explain-local-sklearn-classification/explain-local-sklearn-classification.ipynb @@ -0,0 +1,243 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Breast cancer diagnosis classification with scikit-learn (run model explainer locally)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Copyright (c) Microsoft Corporation. All rights reserved.\n", + "\n", + "Licensed under the MIT License." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Explain a model with the AML explain-model package\n", + "\n", + "1. Train a SVM classification model using Scikit-learn\n", + "2. Run 'explain_model' with full data in local mode, which doesn't contact any Azure services\n", + "3. Run 'explain_model' with summarized data in local mode, which doesn't contact any Azure services" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.datasets import load_breast_cancer\n", + "from sklearn import svm\n", + "from azureml.explain.model.tabular_explainer import TabularExplainer" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 1. Run model explainer locally with full data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load the breast cancer diagnosis data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "breast_cancer_data = load_breast_cancer()\n", + "classes = breast_cancer_data.target_names.tolist()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Split data into train and test\n", + "from sklearn.model_selection import train_test_split\n", + "x_train, x_test, y_train, y_test = train_test_split(breast_cancer_data.data, breast_cancer_data.target, test_size=0.2, random_state=0)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Train a SVM classification model, which you want to explain" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "clf = svm.SVC(gamma=0.001, C=100., probability=True)\n", + "model = clf.fit(x_train, y_train)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explain predictions on your local machine" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "tabular_explainer = TabularExplainer(model, x_train, features=breast_cancer_data.feature_names, classes=classes)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explain overall model predictions (global explanation)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Passing in test dataset for evaluation examples - note it must be a representative sample of the original data\n", + "# x_train can be passed as well, but with more examples explanations will take longer although they may be more accurate\n", + "global_explanation = tabular_explainer.explain_global(x_test)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Sorted SHAP values\n", + "print('ranked global importance values: {}'.format(global_explanation.get_ranked_global_values()))\n", + "# Corresponding feature names\n", + "print('ranked global importance names: {}'.format(global_explanation.get_ranked_global_names()))\n", + "# feature ranks (based on original order of features)\n", + "print('global importance rank: {}'.format(global_explanation.global_importance_rank))\n", + "# per class feature names\n", + "print('ranked per class feature names: {}'.format(global_explanation.get_ranked_per_class_names()))\n", + "# per class feature importance values\n", + "print('ranked per class feature values: {}'.format(global_explanation.get_ranked_per_class_values()))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dict(zip(global_explanation.get_ranked_global_names(), global_explanation.get_ranked_global_values()))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explain overall model predictions as a collection of local (instance-level) explanations" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# feature shap values for all features and all data points in the training data\n", + "print('local importance values: {}'.format(global_explanation.local_importance_values))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explain local data points (individual instances)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "local_explanation = tabular_explainer.explain_local(x_test[0,:])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# local feature importance information\n", + "local_importance_values = local_explanation.local_importance_values\n", + "print('local importance for first instance: {}'.format(local_importance_values[y_test[0]]))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print('local importance feature names: {}'.format(list(local_explanation.features)))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dict(zip(local_explanation.features, local_explanation.local_importance_values[y_test[0]]))" + ] + } + ], + "metadata": { + "authors": [ + { + "name": "wamartin" + } + ], + "kernelspec": { + "display_name": "Python 3.6", + "language": "python", + "name": "python36" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} \ No newline at end of file diff --git a/how-to-use-azureml/explain-model/explain-local-sklearn-regression/explain-local-sklearn-regression.ipynb b/how-to-use-azureml/explain-model/explain-local-sklearn-regression/explain-local-sklearn-regression.ipynb new file mode 100644 index 000000000..afcd5a17c --- /dev/null +++ b/how-to-use-azureml/explain-model/explain-local-sklearn-regression/explain-local-sklearn-regression.ipynb @@ -0,0 +1,231 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Boston Housing Price Prediction with scikit-learn (run model explainer locally)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Copyright (c) Microsoft Corporation. All rights reserved.\n", + "\n", + "Licensed under the MIT License." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Explain a model with the AML explain-model package\n", + "\n", + "1. Train a GradientBoosting regression model using Scikit-learn\n", + "2. Run 'explain_model' with full dataset in local mode, which doesn't contact any Azure services.\n", + "3. Run 'explain_model' with summarized dataset in local mode, which doesn't contact any Azure services." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn import datasets\n", + "from sklearn.ensemble import GradientBoostingRegressor\n", + "from azureml.explain.model.tabular_explainer import TabularExplainer" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 1. Run model explainer locally with full data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load the Boston house price data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "boston_data = datasets.load_boston()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Split data into train and test\n", + "from sklearn.model_selection import train_test_split\n", + "x_train, x_test, y_train, y_test = train_test_split(boston_data.data, boston_data.target, test_size=0.2, random_state=0)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Train a GradientBoosting Regression model, which you want to explain" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "clf = GradientBoostingRegressor(n_estimators=100, max_depth=4,\n", + " learning_rate=0.1, loss='huber',\n", + " random_state=1)\n", + "model = clf.fit(x_train, y_train)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explain predictions on your local machine" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "tabular_explainer = TabularExplainer(model, x_train, features = boston_data.feature_names)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explain overall model predictions (global explanation)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Passing in test dataset for evaluation examples - note it must be a representative sample of the original data\n", + "# x_train can be passed as well, but with more examples explanations will take longer although they may be more accurate\n", + "global_explanation = tabular_explainer.explain_global(x_test)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "help(global_explanation)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Sorted SHAP values \n", + "print('ranked global importance values: {}'.format(global_explanation.get_ranked_global_values()))\n", + "# Corresponding feature names\n", + "print('ranked global importance names: {}'.format(global_explanation.get_ranked_global_names()))\n", + "# feature ranks (based on original order of features)\n", + "print('global importance rank: {}'.format(global_explanation.global_importance_rank))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dict(zip(global_explanation.get_ranked_global_names(), global_explanation.get_ranked_global_values()))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explain overall model predictions as a collection of local (instance-level) explanations" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# feature shap values for all features and all data points in the training data\n", + "print('local importance values: {}'.format(global_explanation.local_importance_values))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explain local data points (individual instances)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "local_explanation = tabular_explainer.explain_local(x_test[0,:])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# local feature importance information\n", + "local_importance_values = local_explanation.local_importance_values\n", + "print('local importance values: {}'.format(local_importance_values))" + ] + } + ], + "metadata": { + "authors": [ + { + "name": "wamartin" + } + ], + "kernelspec": { + "display_name": "Python 3.6", + "language": "python", + "name": "python36" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} \ No newline at end of file diff --git a/how-to-use-azureml/explain-model/explain-on-amlcompute/regression-sklearn-on-amlcompute.ipynb b/how-to-use-azureml/explain-model/explain-on-amlcompute/regression-sklearn-on-amlcompute.ipynb new file mode 100644 index 000000000..f054dd57f --- /dev/null +++ b/how-to-use-azureml/explain-model/explain-on-amlcompute/regression-sklearn-on-amlcompute.ipynb @@ -0,0 +1,602 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Copyright (c) Microsoft Corporation. All rights reserved.\n", + "\n", + "Licensed under the MIT License." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Train using Azure Machine Learning Compute\n", + "\n", + "* Initialize a Workspace\n", + "* Create an Experiment\n", + "* Introduction to AmlCompute\n", + "* Submit an AmlCompute run in a few different ways\n", + " - Provision as a run based compute target \n", + " - Provision as a persistent compute target (Basic)\n", + " - Provision as a persistent compute target (Advanced)\n", + "* Additional operations to perform on AmlCompute\n", + "* Download model explanation data from the Run History Portal\n", + "* Print the explanation data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "Make sure you go through the [configuration notebook](../../../configuration.ipynb) first if you haven't." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Check core SDK version number\n", + "import azureml.core\n", + "\n", + "print(\"SDK version:\", azureml.core.VERSION)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Initialize a Workspace\n", + "\n", + "Initialize a workspace object from persisted configuration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "create workspace" + ] + }, + "outputs": [], + "source": [ + "from azureml.core import Workspace\n", + "\n", + "ws = Workspace.from_config()\n", + "print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\\n')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create An Experiment\n", + "\n", + "**Experiment** is a logical container in an Azure ML Workspace. It hosts run records which can include run metrics and output artifacts from your experiments." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azureml.core import Experiment\n", + "experiment_name = 'explainer-remote-run-on-amlcompute'\n", + "experiment = Experiment(workspace=ws, name=experiment_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Introduction to AmlCompute\n", + "\n", + "Azure Machine Learning Compute is managed compute infrastructure that allows the user to easily create single to multi-node compute of the appropriate VM Family. It is created **within your workspace region** and is a resource that can be used by other users in your workspace. It autoscales by default to the max_nodes, when a job is submitted, and executes in a containerized environment packaging the dependencies as specified by the user. \n", + "\n", + "Since it is managed compute, job scheduling and cluster management are handled internally by Azure Machine Learning service. \n", + "\n", + "For more information on Azure Machine Learning Compute, please read [this article](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)\n", + "\n", + "If you are an existing BatchAI customer who is migrating to Azure Machine Learning, please read [this article](https://aka.ms/batchai-retirement)\n", + "\n", + "**Note**: As with other Azure services, there are limits on certain resources (for eg. AmlCompute quota) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.\n", + "\n", + "\n", + "The training script `run_explainer.py` is already created for you. Let's have a look." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Submit an AmlCompute run in a few different ways\n", + "\n", + "First lets check which VM families are available in your region. Azure is a regional service and some specialized SKUs (especially GPUs) are only available in certain regions. Since AmlCompute is created in the region of your workspace, we will use the supported_vms () function to see if the VM family we want to use ('STANDARD_D2_V2') is supported.\n", + "\n", + "You can also pass a different region to check availability and then re-create your workspace in that region through the [configuration notebook](../../../configuration.ipynb)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azureml.core.compute import ComputeTarget, AmlCompute\n", + "\n", + "AmlCompute.supported_vmsizes(workspace=ws)\n", + "# AmlCompute.supported_vmsizes(workspace=ws, location='southcentralus')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create project directory\n", + "\n", + "Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script, and any additional files your training script depends on" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import shutil\n", + "\n", + "project_folder = './explainer-remote-run-on-amlcompute'\n", + "os.makedirs(project_folder, exist_ok=True)\n", + "shutil.copy('run_explainer.py', project_folder)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Provision as a run based compute target\n", + "\n", + "You can provision AmlCompute as a compute target at run-time. In this case, the compute is auto-created for your run, scales up to max_nodes that you specify, and then **deleted automatically** after the run completes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azureml.core.runconfig import RunConfiguration\n", + "from azureml.core.conda_dependencies import CondaDependencies\n", + "from azureml.core.runconfig import DEFAULT_CPU_IMAGE\n", + "\n", + "# create a new runconfig object\n", + "run_config = RunConfiguration()\n", + "\n", + "# signal that you want to use AmlCompute to execute script.\n", + "run_config.target = \"amlcompute\"\n", + "\n", + "# AmlCompute will be created in the same region as workspace\n", + "# Set vm size for AmlCompute\n", + "run_config.amlcompute.vm_size = 'STANDARD_D2_V2'\n", + "\n", + "# enable Docker \n", + "run_config.environment.docker.enabled = True\n", + "\n", + "# set Docker base image to the default CPU-based image\n", + "run_config.environment.docker.base_image = DEFAULT_CPU_IMAGE\n", + "\n", + "# use conda_dependencies.yml to create a conda environment in the Docker image for execution\n", + "run_config.environment.python.user_managed_dependencies = False\n", + "\n", + "# auto-prepare the Docker image when used for execution (if it is not already prepared)\n", + "run_config.auto_prepare_environment = True\n", + "\n", + "azureml_pip_packages = [\n", + " 'azureml-defaults', 'azureml-contrib-explain-model', 'azureml-core', 'azureml-telemetry',\n", + " 'azureml-explain-model'\n", + "]\n", + "\n", + "# specify CondaDependencies obj\n", + "run_config.environment.python.conda_dependencies = CondaDependencies.create(conda_packages=['scikit-learn'],\n", + " pip_packages=azureml_pip_packages)\n", + "\n", + "# Now submit a run on AmlCompute\n", + "from azureml.core.script_run_config import ScriptRunConfig\n", + "\n", + "script_run_config = ScriptRunConfig(source_directory=project_folder,\n", + " script='run_explainer.py',\n", + " run_config=run_config)\n", + "\n", + "run = experiment.submit(script_run_config)\n", + "\n", + "# Show run details\n", + "run" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note: if you need to cancel a run, you can follow [these instructions](https://aka.ms/aml-docs-cancel-run)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "# Shows output of the run on stdout.\n", + "run.wait_for_completion(show_output=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Provision as a persistent compute target (Basic)\n", + "\n", + "You can provision a persistent AmlCompute resource by simply defining two parameters thanks to smart defaults. By default it autoscales from 0 nodes and provisions dedicated VMs to run your job in a container. This is useful when you want to continously re-use the same target, debug it between jobs or simply share the resource with other users of your workspace.\n", + "\n", + "* `vm_size`: VM family of the nodes provisioned by AmlCompute. Simply choose from the supported_vmsizes() above\n", + "* `max_nodes`: Maximum nodes to autoscale to while running a job on AmlCompute" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azureml.core.compute import ComputeTarget, AmlCompute\n", + "from azureml.core.compute_target import ComputeTargetException\n", + "\n", + "# Choose a name for your CPU cluster\n", + "cpu_cluster_name = \"cpucluster\"\n", + "\n", + "# Verify that cluster does not exist already\n", + "try:\n", + " cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)\n", + " print('Found existing cluster, use it.')\n", + "except ComputeTargetException:\n", + " compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',\n", + " max_nodes=4)\n", + " cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)\n", + "\n", + "cpu_cluster.wait_for_completion(show_output=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Configure & Run" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azureml.core.runconfig import RunConfiguration\n", + "from azureml.core.conda_dependencies import CondaDependencies\n", + "\n", + "# create a new RunConfig object\n", + "run_config = RunConfiguration(framework=\"python\")\n", + "\n", + "# Set compute target to AmlCompute target created in previous step\n", + "run_config.target = cpu_cluster.name\n", + "\n", + "# enable Docker \n", + "run_config.environment.docker.enabled = True\n", + "\n", + "azureml_pip_packages = [\n", + " 'azureml-defaults', 'azureml-contrib-explain-model', 'azureml-core', 'azureml-telemetry',\n", + " 'azureml-explain-model'\n", + "]\n", + "\n", + "# specify CondaDependencies obj\n", + "run_config.environment.python.conda_dependencies = CondaDependencies.create(conda_packages=['scikit-learn'],\n", + " pip_packages=azureml_pip_packages)\n", + "\n", + "from azureml.core import Run\n", + "from azureml.core import ScriptRunConfig\n", + "\n", + "src = ScriptRunConfig(source_directory=project_folder, \n", + " script='run_explainer.py', \n", + " run_config=run_config) \n", + "run = experiment.submit(config=src)\n", + "run" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "# Shows output of the run on stdout.\n", + "run.wait_for_completion(show_output=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run.get_metrics()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Provision as a persistent compute target (Advanced)\n", + "\n", + "You can also specify additional properties or change defaults while provisioning AmlCompute using a more advanced configuration. This is useful when you want a dedicated cluster of 4 nodes (for example you can set the min_nodes and max_nodes to 4), or want the compute to be within an existing VNet in your subscription.\n", + "\n", + "In addition to `vm_size` and `max_nodes`, you can specify:\n", + "* `min_nodes`: Minimum nodes (default 0 nodes) to downscale to while running a job on AmlCompute\n", + "* `vm_priority`: Choose between 'dedicated' (default) and 'lowpriority' VMs when provisioning AmlCompute. Low Priority VMs use Azure's excess capacity and are thus cheaper but risk your run being pre-empted\n", + "* `idle_seconds_before_scaledown`: Idle time (default 120 seconds) to wait after run completion before auto-scaling to min_nodes\n", + "* `vnet_resourcegroup_name`: Resource group of the **existing** VNet within which AmlCompute should be provisioned\n", + "* `vnet_name`: Name of VNet\n", + "* `subnet_name`: Name of SubNet within the VNet" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azureml.core.compute import ComputeTarget, AmlCompute\n", + "from azureml.core.compute_target import ComputeTargetException\n", + "\n", + "# Choose a name for your CPU cluster\n", + "cpu_cluster_name = \"cpucluster\"\n", + "\n", + "# Verify that cluster does not exist already\n", + "try:\n", + " cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)\n", + " print('Found existing cluster, use it.')\n", + "except ComputeTargetException:\n", + " compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',\n", + " vm_priority='lowpriority',\n", + " min_nodes=2,\n", + " max_nodes=4,\n", + " idle_seconds_before_scaledown='300',\n", + " vnet_resourcegroup_name='',\n", + " vnet_name='',\n", + " subnet_name='')\n", + " cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)\n", + "\n", + "cpu_cluster.wait_for_completion(show_output=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Configure & Run" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azureml.core.runconfig import RunConfiguration\n", + "from azureml.core.conda_dependencies import CondaDependencies\n", + "\n", + "# create a new RunConfig object\n", + "run_config = RunConfiguration(framework=\"python\")\n", + "\n", + "# Set compute target to AmlCompute target created in previous step\n", + "run_config.target = cpu_cluster.name\n", + "\n", + "# enable Docker \n", + "run_config.environment.docker.enabled = True\n", + "\n", + "azureml_pip_packages = [\n", + " 'azureml-defaults', 'azureml-contrib-explain-model', 'azureml-core', 'azureml-telemetry',\n", + " 'azureml-explain-model'\n", + "]\n", + "\n", + "# specify CondaDependencies obj\n", + "run_config.environment.python.conda_dependencies = CondaDependencies.create(conda_packages=['scikit-learn'],\n", + " pip_packages=azureml_pip_packages)\n", + "\n", + "from azureml.core import Run\n", + "from azureml.core import ScriptRunConfig\n", + "\n", + "src = ScriptRunConfig(source_directory=project_folder, \n", + " script='run_explainer.py', \n", + " run_config=run_config) \n", + "run = experiment.submit(config=src)\n", + "run" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "# Shows output of the run on stdout.\n", + "run.wait_for_completion(show_output=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run.get_metrics()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azureml.contrib.explain.model.explanation.explanation_client import ExplanationClient\n", + "\n", + "client = ExplanationClient.from_run(run)\n", + "# Get the top k (e.g., 4) most important features with their importance values\n", + "explanation = client.download_model_explanation(top_k=4)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Additional operations to perform on AmlCompute\n", + "\n", + "You can perform more operations on AmlCompute such as updating the node counts or deleting the compute. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Get_status () gets the latest status of the AmlCompute target\n", + "cpu_cluster.get_status().serialize()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Update () takes in the min_nodes, max_nodes and idle_seconds_before_scaledown and updates the AmlCompute target\n", + "# cpu_cluster.update(min_nodes=1)\n", + "# cpu_cluster.update(max_nodes=10)\n", + "cpu_cluster.update(idle_seconds_before_scaledown=300)\n", + "# cpu_cluster.update(min_nodes=2, max_nodes=4, idle_seconds_before_scaledown=600)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Delete () is used to deprovision and delete the AmlCompute target. Useful if you want to re-use the compute name \n", + "# 'cpucluster' in this case but use a different VM family for instance.\n", + "\n", + "# cpu_cluster.delete()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Download Model Explanation Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azureml.contrib.explain.model.explanation.explanation_client import ExplanationClient\n", + "\n", + "# Get model explanation data\n", + "client = ExplanationClient.from_run(run)\n", + "explanation = client.download_model_explanation()\n", + "local_importance_values = explanation.local_importance_values\n", + "expected_values = explanation.expected_values\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Or you can use the saved run.id to retrive the feature importance values\n", + "client = ExplanationClient.from_run_id(ws, experiment_name, run.id)\n", + "explanation = client.download_model_explanation()\n", + "local_importance_values = explanation.local_importance_values\n", + "expected_values = explanation.expected_values" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Get the top k (e.g., 4) most important features with their importance values\n", + "explanation = client.download_model_explanation(top_k=4)\n", + "global_importance_values = explanation.get_ranked_global_values()\n", + "global_importance_names = explanation.get_ranked_global_names()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print('global importance values: {}'.format(global_importance_values))\n", + "print('global importance names: {}'.format(global_importance_names))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Success!\n", + "Great, you are ready to move on to the remaining notebooks." + ] + } + ], + "metadata": { + "authors": [ + { + "name": "wamartin" + } + ], + "kernelspec": { + "display_name": "Python 3.6", + "language": "python", + "name": "python36" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} \ No newline at end of file diff --git a/how-to-use-azureml/explain-model/explain-on-amlcompute/run_explainer.py b/how-to-use-azureml/explain-model/explain-on-amlcompute/run_explainer.py new file mode 100644 index 000000000..720720800 --- /dev/null +++ b/how-to-use-azureml/explain-model/explain-on-amlcompute/run_explainer.py @@ -0,0 +1,52 @@ +# Copyright (c) Microsoft. All rights reserved. +# Licensed under the MIT license. + +from sklearn import datasets +from sklearn.linear_model import Ridge +from azureml.explain.model.tabular_explainer import TabularExplainer +from azureml.contrib.explain.model.explanation.explanation_client import ExplanationClient +from sklearn.model_selection import train_test_split +from azureml.core.run import Run +from sklearn.externals import joblib +import os +import numpy as np + +os.makedirs('./outputs', exist_ok=True) + +boston_data = datasets.load_boston() + +run = Run.get_context() +client = ExplanationClient.from_run(run) + +X_train, X_test, y_train, y_test = train_test_split(boston_data.data, + boston_data.target, + test_size=0.2, + random_state=0) + +alpha = 0.5 +# Use Ridge algorithm to create a regression model +reg = Ridge(alpha) +model = reg.fit(X_train, y_train) + +preds = reg.predict(X_test) +run.log('alpha', alpha) + +model_file_name = 'ridge_{0:.2f}.pkl'.format(alpha) +# save model in the outputs folder so it automatically get uploaded +with open(model_file_name, 'wb') as file: + joblib.dump(value=reg, filename=os.path.join('./outputs/', + model_file_name)) + +# Explain predictions on your local machine +tabular_explainer = TabularExplainer(model, X_train, features=boston_data.feature_names) + +# Explain overall model predictions (global explanation) +# Passing in test dataset for evaluation examples - note it must be a representative sample of the original data +# x_train can be passed as well, but with more examples explanations it will +# take longer although they may be more accurate +global_explanation = tabular_explainer.explain_global(X_test) + +# Uploading model explanation data for storage or visualization in webUX +# The explanation can then be downloaded on any compute +comment = 'Global explanation on regression model trained on boston dataset' +client.upload_model_explanation(global_explanation, comment=comment) diff --git a/how-to-use-azureml/explain-model/explain-run-history-sklearn-classification/explain-run-history-sklearn-classification.ipynb b/how-to-use-azureml/explain-model/explain-run-history-sklearn-classification/explain-run-history-sklearn-classification.ipynb new file mode 100644 index 000000000..4c9c489a5 --- /dev/null +++ b/how-to-use-azureml/explain-model/explain-run-history-sklearn-classification/explain-run-history-sklearn-classification.ipynb @@ -0,0 +1,255 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Breast cancer diagnosis classification with scikit-learn (save model explanations via AML Run History)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Copyright (c) Microsoft Corporation. All rights reserved.\n", + "\n", + "Licensed under the MIT License." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Explain a model with the AML explain-model package\n", + "\n", + "1. Train a SVM classification model using Scikit-learn\n", + "2. Run 'explain_model' with AML Run History, which leverages run history service to store and manage the explanation data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.datasets import load_breast_cancer\n", + "from sklearn import svm\n", + "from azureml.explain.model.tabular_explainer import TabularExplainer" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 1. Run model explainer locally with full data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load the breast cancer diagnosis data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "breast_cancer_data = load_breast_cancer()\n", + "classes = breast_cancer_data.target_names.tolist()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Split data into train and test\n", + "from sklearn.model_selection import train_test_split\n", + "x_train, x_test, y_train, y_test = train_test_split(breast_cancer_data.data, breast_cancer_data.target, test_size=0.2, random_state=0)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Train a SVM classification model, which you want to explain" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "clf = svm.SVC(gamma=0.001, C=100., probability=True)\n", + "model = clf.fit(x_train, y_train)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explain predictions on your local machine" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "tabular_explainer = TabularExplainer(model, x_train, features=breast_cancer_data.feature_names, classes=classes)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explain overall model predictions (global explanation)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Passing in test dataset for evaluation examples - note it must be a representative sample of the original data\n", + "# x_train can be passed as well, but with more examples explanations will take longer although they may be more accurate\n", + "global_explanation = tabular_explainer.explain_global(x_test)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 2. Save Model Explanation With AML Run History" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import azureml.core\n", + "from azureml.core import Workspace, Experiment, Run\n", + "from azureml.explain.model.tabular_explainer import TabularExplainer\n", + "from azureml.contrib.explain.model.explanation.explanation_client import ExplanationClient\n", + "# Check core SDK version number\n", + "print(\"SDK version:\", azureml.core.VERSION)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ws = Workspace.from_config()\n", + "print('Workspace name: ' + ws.name, \n", + " 'Azure region: ' + ws.location, \n", + " 'Subscription id: ' + ws.subscription_id, \n", + " 'Resource group: ' + ws.resource_group, sep = '\\n')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "experiment_name = 'explain_model'\n", + "experiment = Experiment(ws, experiment_name)\n", + "run = experiment.start_logging()\n", + "client = ExplanationClient.from_run(run)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Uploading model explanation data for storage or visualization in webUX\n", + "# The explanation can then be downloaded on any compute\n", + "client.upload_model_explanation(global_explanation)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Get model explanation data\n", + "explanation = client.download_model_explanation()\n", + "local_importance_values = explanation.local_importance_values\n", + "expected_values = explanation.expected_values" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Get the top k (e.g., 4) most important features with their importance values\n", + "explanation = client.download_model_explanation(top_k=4)\n", + "global_importance_values = explanation.get_ranked_global_values()\n", + "global_importance_names = explanation.get_ranked_global_names()\n", + "per_class_names = explanation.get_ranked_per_class_names()[0]\n", + "per_class_values = explanation.get_ranked_per_class_values()[0]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print('per class feature importance values: {}'.format(per_class_values))\n", + "print('per class feature importance names: {}'.format(per_class_names))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dict(zip(per_class_names, per_class_values))" + ] + } + ], + "metadata": { + "authors": [ + { + "name": "wamartin" + } + ], + "kernelspec": { + "display_name": "Python 3.6", + "language": "python", + "name": "python36" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} \ No newline at end of file diff --git a/how-to-use-azureml/explain-model/explain-run-history-sklearn-regression/explain-run-history-sklearn-regression.ipynb b/how-to-use-azureml/explain-model/explain-run-history-sklearn-regression/explain-run-history-sklearn-regression.ipynb new file mode 100644 index 000000000..6b6edd25f --- /dev/null +++ b/how-to-use-azureml/explain-model/explain-run-history-sklearn-regression/explain-run-history-sklearn-regression.ipynb @@ -0,0 +1,269 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Boston Housing Price Prediction with scikit-learn (save model explanations via AML Run History)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Copyright (c) Microsoft Corporation. All rights reserved.\n", + "\n", + "Licensed under the MIT License." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Explain a model with the AML explain-model package\n", + "\n", + "1. Train a GradientBoosting regression model using Scikit-learn\n", + "2. Run 'explain_model' with AML Run History, which leverages run history service to store and manage the explanation data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Save Model Explanation With AML Run History" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#Import Iris dataset\n", + "from sklearn import datasets\n", + "from sklearn.ensemble import GradientBoostingRegressor\n", + "\n", + "import azureml.core\n", + "from azureml.core import Workspace, Experiment, Run\n", + "from azureml.explain.model.tabular_explainer import TabularExplainer\n", + "from azureml.contrib.explain.model.explanation.explanation_client import ExplanationClient\n", + "# Check core SDK version number\n", + "print(\"SDK version:\", azureml.core.VERSION)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ws = Workspace.from_config()\n", + "print('Workspace name: ' + ws.name, \n", + " 'Azure region: ' + ws.location, \n", + " 'Subscription id: ' + ws.subscription_id, \n", + " 'Resource group: ' + ws.resource_group, sep = '\\n')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "experiment_name = 'explain_model'\n", + "experiment = Experiment(ws, experiment_name)\n", + "run = experiment.start_logging()\n", + "client = ExplanationClient.from_run(run)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load the Boston house price data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "boston_data = datasets.load_boston()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Split data into train and test\n", + "from sklearn.model_selection import train_test_split\n", + "x_train, x_test, y_train, y_test = train_test_split(boston_data.data, boston_data.target, test_size=0.2, random_state=0)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Train a GradientBoosting Regression model, which you want to explain" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "clf = GradientBoostingRegressor(n_estimators=100, max_depth=4,\n", + " learning_rate=0.1, loss='huber',\n", + " random_state=1)\n", + "model = clf.fit(x_train, y_train)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explain predictions on your local machine" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "tabular_explainer = TabularExplainer(model, x_train, features=boston_data.feature_names)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explain overall model predictions (global explanation)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Passing in test dataset for evaluation examples - note it must be a representative sample of the original data\n", + "# x_train can be passed as well, but with more examples explanations will take longer although they may be more accurate\n", + "global_explanation = tabular_explainer.explain_global(x_test)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Uploading model explanation data for storage or visualization in webUX\n", + "# The explanation can then be downloaded on any compute\n", + "client.upload_model_explanation(global_explanation)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Get model explanation data\n", + "explanation = client.download_model_explanation()\n", + "local_importance_values = explanation.local_importance_values\n", + "expected_values = explanation.expected_values" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Print the values\n", + "print('expected values: {}'.format(expected_values))\n", + "print('local importance values: {}'.format(local_importance_values))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Get the top k (e.g., 4) most important features with their importance values\n", + "explanation = client.download_model_explanation(top_k=4)\n", + "global_importance_values = explanation.get_ranked_global_values()\n", + "global_importance_names = explanation.get_ranked_global_names()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print('global importance values: {}'.format(global_importance_values))\n", + "print('global importance names: {}'.format(global_importance_names))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explain individual instance predictions (local explanation) ##### needs to get updated with the new build" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "local_explanation = tabular_explainer.explain_local(x_test[0,:])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# local feature importance information\n", + "local_importance_values = local_explanation.local_importance_values\n", + "print('local importance values: {}'.format(local_importance_values))" + ] + } + ], + "metadata": { + "authors": [ + { + "name": "wamartin" + } + ], + "kernelspec": { + "display_name": "Python 3.6", + "language": "python", + "name": "python36" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} \ No newline at end of file diff --git a/how-to-use-azureml/explain-model/explain-sklearn-raw-features/explain-sklearn-raw-features.ipynb b/how-to-use-azureml/explain-model/explain-sklearn-raw-features/explain-sklearn-raw-features.ipynb new file mode 100644 index 000000000..65c4021d5 --- /dev/null +++ b/how-to-use-azureml/explain-model/explain-sklearn-raw-features/explain-sklearn-raw-features.ipynb @@ -0,0 +1,221 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Summary\n", + "From raw data that is a mixture of categoricals and numeric, featurize the categoricals using one hot encoding. Use tabular explainer to get explain object and then get raw feature importances" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Copyright (c) Microsoft Corporation. All rights reserved.\n", + "\n", + "Licensed under the MIT License." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Load titanic dataset. Impute missing values by filling both backward and forward since some data is at the first/last row. This is just for illustration and not a recommended way to impute missing data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "titanic_url = ('https://raw.githubusercontent.com/amueller/'\n", + " 'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')\n", + "data = pd.read_csv(titanic_url)\n", + "# fill missing values\n", + "data = data.fillna(method=\"ffill\")\n", + "data = data.fillna(method=\"bfill\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data.columns" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Similar to example [here](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py), use a subset of columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "numeric_features = ['age', 'fare']\n", + "categorical_features = ['embarked', 'sex', 'pclass']\n", + "\n", + "y = data['survived'].values\n", + "X = data[categorical_features + numeric_features]\n", + "\n", + "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One hot encode the categorical features" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.preprocessing import OneHotEncoder\n", + "one_enc = OneHotEncoder()\n", + "one_enc.fit(X_train[categorical_features])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Columnwise concatenate one hot encoded categoricals and numerical features." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "from scipy import sparse\n", + "def get_feats(X):\n", + " a = one_enc.transform(X[categorical_features])\n", + " b = X[numeric_features]\n", + " return sparse.hstack((one_enc.transform(X[categorical_features]), X[numeric_features].values))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Train a logistic regression model on featurized training data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.linear_model import LogisticRegression\n", + "\n", + "X_train_transformed = get_feats(X_train)\n", + "X_test_transformed = get_feats(X_test)\n", + "\n", + "clf = LogisticRegression(solver='lbfgs', max_iter=200)\n", + "clf.fit(X_train_transformed, y_train)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Get feature mapping between raw and generated features. Using the order in which features are concatenated in `get_feats` and using `categories_` in `OneHotEncoder` we are able to compute this mapping." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "raw_feat_mapping = []\n", + "start_index = 0\n", + "for cat_list in one_enc.categories_:\n", + " raw_feat_mapping.append([start_index + i for i in range(len(cat_list))])\n", + " start_index += len(cat_list)\n", + "for i in range(len(numeric_features)):\n", + " raw_feat_mapping.append([start_index])\n", + " start_index += 1 " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azureml.explain.model.tabular_explainer import TabularExplainer\n", + "\n", + "explainer = TabularExplainer(clf, X_train_transformed)\n", + "global_explanation = explainer.explain_global(X_test_transformed)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "raw_feat_imps = global_explanation.get_raw_feature_importances(raw_feat_mapping)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "feature_names = categorical_features + numeric_features\n", + "sorted_indices = np.argsort(raw_feat_imps)[::-1]\n", + "\n", + "for i in sorted_indices:\n", + " print(\"{}: {}\".format(feature_names[i], raw_feat_imps[i]))" + ] + } + ], + "metadata": { + "authors": [ + { + "name": "hichando" + } + ], + "kernelspec": { + "display_name": "Python 3.6", + "language": "python", + "name": "python36" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} \ No newline at end of file diff --git a/how-to-use-azureml/explain-model/explain-tabular-data/explain-tabular-data.ipynb b/how-to-use-azureml/explain-model/explain-tabular-data/explain-tabular-data.ipynb new file mode 100644 index 000000000..22075b3f4 --- /dev/null +++ b/how-to-use-azureml/explain-model/explain-tabular-data/explain-tabular-data.ipynb @@ -0,0 +1,267 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Copyright (c) Microsoft Corporation. All rights reserved.\n", + "\n", + "Licensed under the MIT License." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Uncomment these if explanation packages are not already installed in your environment\n", + "#!pip install --upgrade azureml-sdk[explain]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Explain a model with the AML explain-model package\n", + "\n", + "1. Train a SVM model using Scikit-learn\n", + "2. Run 'explain_model' in local mode, which doesn't contact any Azure services\n", + "3. Run 'explain_model' with AML Run History, which leverages Run History Service to store and manage the explanation data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Disclaimer: this notebook is a preview of model explainability, and the APIs shown below are subject to breaking changes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Train a SVM model, which we will try to explain" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Import Iris dataset\n", + "from sklearn import datasets\n", + "iris = datasets.load_iris()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Split data into train and test\n", + "from sklearn.model_selection import train_test_split\n", + "x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Import scikit learn, fit a SVM model\n", + "def create_scikit_learn_model(X, y):\n", + " from sklearn import svm\n", + " clf = svm.SVC(gamma=0.001, C=100., probability=True)\n", + " model = clf.fit(X, y)\n", + " return model\n", + "model = create_scikit_learn_model(x_train, y_train)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Run model explainer locally" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azureml.explain.model.tabular_explainer import TabularExplainer" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import time\n", + "start = time.time()\n", + "\n", + "explainer = TabularExplainer(model, x_train, features=iris.feature_names)\n", + "global_explanation = explainer.explain_global(x_test)\n", + "\n", + "# importance values for each class, test example, and feature (local importance)\n", + "local_imp_values = global_explanation.local_importance_values\n", + "# base prediction with feature importances ignored\n", + "expected_values = global_explanation.expected_values\n", + "# global feature importance information\n", + "global_imp_values = global_explanation.global_importance_values\n", + "ranked_global_imp_names = global_explanation.get_ranked_global_names()\n", + "# global per-class feature importance information\n", + "per_class_imp_values = global_explanation.per_class_values\n", + "ranked_per_class_imp_names = global_explanation.get_ranked_per_class_names()\n", + "\n", + "end = time.time()\n", + "print(end - start)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Run model explainer with AML Run History" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import azureml.core\n", + "from azureml.core import Workspace, Experiment, Run\n", + "from azureml.explain.model.tabular_explainer import TabularExplainer\n", + "from azureml.contrib.explain.model.explanation.explanation_client import ExplanationClient\n", + "# Check core SDK version number\n", + "print(\"SDK version:\", azureml.core.VERSION)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ws = Workspace.from_config()\n", + "print('Workspace name: ' + ws.name, \n", + " 'Azure region: ' + ws.location, \n", + " 'Subscription id: ' + ws.subscription_id, \n", + " 'Resource group: ' + ws.resource_group, sep = '\\n')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "experiment_name = 'explain_model'\n", + "experiment = Experiment(ws, experiment_name)\n", + "run = experiment.start_logging()\n", + "client = ExplanationClient.from_run(run)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import time\n", + "start = time.time()\n", + "explainer = TabularExplainer(model, x_train, features=iris.feature_names, classes=iris.target_names)\n", + "explanation = explainer.explain_global(x_test)\n", + "client.upload_model_explanation(explanation)\n", + "end = time.time()\n", + "print(end - start)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "explanation_from_run = client.download_model_explanation()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# global feature importance information\n", + "global_imp_values = explanation_from_run.global_importance_values\n", + "global_imp_names = explanation_from_run.get_ranked_global_names()\n", + "# global per-class feature importance information\n", + "per_class_imp_values = explanation_from_run.per_class_values\n", + "per_class_imp_names = explanation_from_run.get_ranked_per_class_names()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## This visualization is unsupported, and is not guaranteed to work in the future" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Get the shap values and explore locally\n", + "import shap\n", + "import numpy as np\n", + "shap.initjs()\n", + "display(shap.force_plot(explanation_from_run.expected_values[1], np.asarray(explanation_from_run.local_importance_values[1]), x_test))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run.complete()" + ] + } + ], + "metadata": { + "authors": [ + { + "name": "wamartin" + } + ], + "kernelspec": { + "display_name": "Python 3.6", + "language": "python", + "name": "python36" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} \ No newline at end of file diff --git a/how-to-use-azureml/machine-learning-pipelines/README.md b/how-to-use-azureml/machine-learning-pipelines/README.md index 0d5f6dcea..093bc73e0 100644 --- a/how-to-use-azureml/machine-learning-pipelines/README.md +++ b/how-to-use-azureml/machine-learning-pipelines/README.md @@ -47,6 +47,7 @@ In this directory, there are two types of notebooks: 7. [aml-pipelines-parameter-tuning-with-hyperdrive.ipynb](https://aka.ms/pl-hyperdrive) 8. [aml-pipelines-how-to-use-azurebatch-to-run-a-windows-executable.ipynb](https://aka.ms/pl-azbatch) 9. [aml-pipelines-setup-schedule-for-a-published-pipeline.ipynb](https://aka.ms/pl-schedule) +10. [aml-pipelines-with-automated-machine-learning-step.ipynb](https://aka.ms/pl-automl) * The second type of notebooks illustrate more sophisticated scenarios, and are independent of each other. These notebooks include: diff --git a/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-setup-schedule-for-a-published-pipeline.ipynb b/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-setup-schedule-for-a-published-pipeline.ipynb index f3ea27313..77187ebfa 100644 --- a/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-setup-schedule-for-a-published-pipeline.ipynb +++ b/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-setup-schedule-for-a-published-pipeline.ipynb @@ -326,7 +326,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Reactivate the schedule" + "### Reenable the schedule" ] }, { @@ -337,9 +337,9 @@ "source": [ "# Set the wait_for_provisioning flag to False if you do not want to wait \n", "# for the call to provision the schedule in the backend.\n", - "fetched_schedule.activate(wait_for_provisioning=True)\n", + "fetched_schedule.enable(wait_for_provisioning=True)\n", "fetched_schedule = Schedule.get(ws, schedule_id)\n", - "print(\"Activated schedule {}. New status is: {}\".format(fetched_schedule.id, fetched_schedule.status))" + "print(\"Enabled schedule {}. New status is: {}\".format(fetched_schedule.id, fetched_schedule.status))" ] }, { diff --git a/how-to-use-azureml/machine-learning-pipelines/pipeline-batch-scoring/pipeline-batch-scoring.ipynb b/how-to-use-azureml/machine-learning-pipelines/pipeline-batch-scoring/pipeline-batch-scoring.ipynb index 9108eb247..f544e3fad 100644 --- a/how-to-use-azureml/machine-learning-pipelines/pipeline-batch-scoring/pipeline-batch-scoring.ipynb +++ b/how-to-use-azureml/machine-learning-pipelines/pipeline-batch-scoring/pipeline-batch-scoring.ipynb @@ -569,7 +569,7 @@ "metadata": { "authors": [ { - "name": "hichando" + "name": "sanpil" } ], "kernelspec": { diff --git a/how-to-use-azureml/machine-learning-pipelines/pipeline-style-transfer/pipeline-style-transfer.ipynb b/how-to-use-azureml/machine-learning-pipelines/pipeline-style-transfer/pipeline-style-transfer.ipynb index 28ef50684..2ab8ae6a4 100644 --- a/how-to-use-azureml/machine-learning-pipelines/pipeline-style-transfer/pipeline-style-transfer.ipynb +++ b/how-to-use-azureml/machine-learning-pipelines/pipeline-style-transfer/pipeline-style-transfer.ipynb @@ -619,7 +619,7 @@ "metadata": { "authors": [ { - "name": "hichando" + "name": "sanpil" } ], "kernelspec": { diff --git a/how-to-use-azureml/training/logging-api/img/run_details.PNG b/how-to-use-azureml/training/logging-api/img/run_details.PNG new file mode 100644 index 000000000..9bfe60fd1 Binary files /dev/null and b/how-to-use-azureml/training/logging-api/img/run_details.PNG differ diff --git a/how-to-use-azureml/training/logging-api/img/run_history.png b/how-to-use-azureml/training/logging-api/img/run_history.png new file mode 100644 index 000000000..3dd32de61 Binary files /dev/null and b/how-to-use-azureml/training/logging-api/img/run_history.png differ diff --git a/how-to-use-azureml/training/logging-api/logging-api.ipynb b/how-to-use-azureml/training/logging-api/logging-api.ipynb index 919393127..690779910 100644 --- a/how-to-use-azureml/training/logging-api/logging-api.ipynb +++ b/how-to-use-azureml/training/logging-api/logging-api.ipynb @@ -1,328 +1,530 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Copyright (c) Microsoft Corporation. All rights reserved.\n", - "\n", - "Licensed under the MIT License." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 06. Logging APIs\n", - "This notebook showcase various ways to use the Azure Machine Learning service run logging APIs, and view the results in the Azure portal." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Prerequisites\n", - "Make sure you go through the [configuration notebook](../../../configuration.ipynb) first if you haven't. Also make sure you have tqdm and matplotlib installed in the current kernel.\n", - "\n", - "```\n", - "(myenv) $ conda install -y tqdm matplotlib\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Validate Azure ML SDK installation and get version number for debugging purposes" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "install" - ] - }, - "outputs": [], - "source": [ - "from azureml.core import Experiment, Workspace\n", - "import azureml.core\n", - "import numpy as np\n", - "\n", - "# Check core SDK version number\n", - "print(\"SDK version:\", azureml.core.VERSION)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Initialize Workspace\n", - "\n", - "Initialize a workspace object from persisted configuration." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "create workspace" - ] - }, - "outputs": [], - "source": [ - "ws = Workspace.from_config()\n", - "print('Workspace name: ' + ws.name, \n", - " 'Azure region: ' + ws.location, \n", - " 'Subscription id: ' + ws.subscription_id, \n", - " 'Resource group: ' + ws.resource_group, sep='\\n')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Set experiment\n", - "Create a new experiment (or get the one with such name)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "exp = Experiment(workspace=ws, name='logging-api-test')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Log metrics\n", - "We will start a run, and use the various logging APIs to record different types of metrics during the run." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from tqdm import tqdm\n", - "\n", - "# start logging for the run\n", - "run = exp.start_logging()\n", - "\n", - "# log a string value\n", - "run.log(name='Name', value='Logging API run')\n", - "\n", - "# log a numerical value\n", - "run.log(name='Magic Number', value=42)\n", - "\n", - "# Log a list of values. Note this will generate a single-variable line chart.\n", - "run.log_list(name='Fibonacci', value=[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89])\n", - "\n", - "# create a dictionary to hold a table of values\n", - "sines = {}\n", - "sines['angle'] = []\n", - "sines['sine'] = []\n", - "\n", - "for i in tqdm(range(-10, 10)):\n", - " # log a metric value repeatedly, this will generate a single-variable line chart.\n", - " run.log(name='Sigmoid', value=1 / (1 + np.exp(-i)))\n", - " angle = i / 2.0\n", - " \n", - " # log a 2 (or more) values as a metric repeatedly. This will generate a 2-variable line chart if you have 2 numerical columns.\n", - " run.log_row(name='Cosine Wave', angle=angle, cos=np.cos(angle))\n", - " \n", - " sines['angle'].append(angle)\n", - " sines['sine'].append(np.sin(angle))\n", - "\n", - "# log a dictionary as a table, this will generate a 2-variable chart if you have 2 numerical columns\n", - "run.log_table(name='Sine Wave', value=sines)\n", - "\n", - "run.complete()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Even after the run is marked completed, you can still log things." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Log an image\n", - "This is how to log a _matplotlib_ pyplot object." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%matplotlib inline\n", - "import matplotlib.pyplot as plt\n", - "angle = np.linspace(-3, 3, 50)\n", - "plt.plot(angle, np.tanh(angle), label='tanh')\n", - "plt.legend(fontsize=12)\n", - "plt.title('Hyperbolic Tangent', fontsize=16)\n", - "plt.grid(True)\n", - "\n", - "run.log_image(name='Hyperbolic Tangent', plot=plt)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Upload a file" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can also upload an abitrary file. First, let's create a dummy file locally." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%writefile myfile.txt\n", - "\n", - "This is a dummy file." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now let's upload this file into the run record as a run artifact, and display the properties after the upload." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "props = run.upload_file(name='outputs/myfile_in_the_cloud.txt', path_or_stream='./myfile.txt')\n", - "props.serialize()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Examine the run" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now let's take a look at the run detail page in Azure portal. Make sure you checkout the various charts and plots generated/uploaded." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "run" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can get all the metrics in that run back." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "run.get_metrics()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can also see the files uploaded for this run." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "run.get_file_names()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can also download all the files locally." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "os.makedirs('files', exist_ok=True)\n", - "\n", - "for f in run.get_file_names():\n", - " dest = os.path.join('files', f.split('/')[-1])\n", - " print('Downloading file {} to {}...'.format(f, dest))\n", - " run.download_file(f, dest) " - ] - } + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Copyright (c) Microsoft Corporation. All rights reserved.\n", + "\n", + "Licensed under the MIT License." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Logging\n", + "\n", + "_**This notebook showcases various ways to use the Azure Machine Learning service run logging APIs, and view the results in the Azure portal.**_\n", + "\n", + "---\n", + "---\n", + "\n", + "## Table of Contents\n", + "\n", + "1. [Introduction](#Introduction)\n", + "1. [Setup](#Setup)\n", + " 1. Validate Azure ML SDK installation\n", + " 1. Initialize workspace\n", + " 1. Set experiment\n", + "1. [Logging](#Logging)\n", + " 1. Starting a run\n", + " 1. Viewing a run in the portal\n", + " 1. Viewing the experiment in the portal\n", + " 1. Logging metrics\n", + " 1. Logging string metrics\n", + " 1. Logging numeric metrics\n", + " 1. Logging vectors\n", + " 1. Logging tables\n", + " 1. Uploading files\n", + "1. [Analyzing results](#Analyzing-results)\n", + " 1. Tagging a run\n", + "1. [Next steps](#Next-steps)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Introduction\n", + "\n", + "Logging metrics from runs in your experiments allows you to track results from one run to another, determining trends in your outputs and understand how your inputs correspond to your model and script performance. Azure Machine Learning services (AzureML) allows you to track various types of metrics including images and arbitrary files in order to understand, analyze, and audit your experimental progress. \n", + "\n", + "Typically you should log all parameters for your experiment and all numerical and string outputs of your experiment. This will allow you to analyze the performance of your experiments across multiple runs, correlate inputs to outputs, and filter runs based on interesting criteria.\n", + "\n", + "The experiment's Run History report page automatically creates a report that can be customized to show the KPI's, charts, and column sets that are interesting to you. \n", + "\n", + "| ![Run Details](./img/run_details.PNG) | ![Run History](./img/run_history.PNG) |\n", + "|:--:|:--:|\n", + "| *Run Details* | *Run History* |\n", + "\n", + "---\n", + "\n", + "## Setup\n", + "\n", + "Make sure you go through the [configuration notebook](../../../configuration.ipynb) first if you haven't. Also make sure you have tqdm and matplotlib installed in the current kernel.\n", + "\n", + "```\n", + "(myenv) $ conda install -y tqdm matplotlib\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Validate Azure ML SDK installation and get version number for debugging purposes" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "install" + ] + }, + "outputs": [], + "source": [ + "from azureml.core import Experiment, Workspace, Run\n", + "import azureml.core\n", + "import numpy as np\n", + "from tqdm import tqdm\n", + "\n", + "# Check core SDK version number\n", + "\n", + "print(\"This notebook was created using SDK version AZUREML-SDK-VERSION, you are currently running version\", azureml.core.VERSION)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Initialize workspace\n", + "\n", + "Initialize a workspace object from persisted configuration." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "create workspace" + ] + }, + "outputs": [], + "source": [ + "ws = Workspace.from_config()\n", + "print('Workspace name: ' + ws.name, \n", + " 'Azure region: ' + ws.location, \n", + " 'Subscription id: ' + ws.subscription_id, \n", + " 'Resource group: ' + ws.resource_group, sep='\\n')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Set experiment\n", + "Create a new experiment (or get the one with the specified name). An *experiment* is a container for an arbitrary set of *runs*. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "experiment = Experiment(workspace=ws, name='logging-api-test')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "## Logging\n", + "In this section we will explore the various logging mechanisms.\n", + "\n", + "### Starting a run\n", + "\n", + "A *run* is a singular experimental trial. In this notebook we will create a run directly on the experiment by calling `run = exp.start_logging()`. If you were experimenting by submitting a script file as an experiment using ``experiment.submit()``, you would call `run = Run.get_context()` in your script to access the run context of your code. In either case, the logging methods on the returned run object work the same.\n", + "\n", + "This cell also stores the run id for use later in this notebook. The run_id is not necessary for logging." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# start logging for the run\n", + "run = experiment.start_logging()\n", + "\n", + "# access the run id for use later\n", + "run_id = run.id\n", + "\n", + "# change the scale factor on different runs to see how you can compare multiple runs\n", + "scale_factor = 2\n", + "\n", + "# change the category on different runs to see how to organize data in reports\n", + "category = 'Red'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Viewing a run in the Portal\n", + "Once a run is started you can see the run in the portal by simply typing ``run``. Clicking on the \"Link to Portal\" link will take you to the Run Details page that shows the metrics you have logged and other run properties. You can refresh this page after each logging statement to see the updated results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Viewing an experiment in the portal\n", + "You can also view an experiement similarly by typing `experiment`. The portal link will take you to the experiment's Run History page that shows all runs and allows you to analyze trends across multiple runs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "experiment" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Logging metrics\n", + "Metrics are visible in the run details page in the AzureML portal and also can be analyzed in experiment reports. The run details page looks as below and contains tabs for Details, Outputs, Logs, and Snapshot. \n", + "* The Details page displays attributes about the run, plus logged metrics and images. Metrics that are vectors appear as charts. \n", + "* The Outputs page contains any files, such as models, you uploaded into the \"outputs\" directory from your run into storage. If you place files in the \"outputs\" directory locally, the files are automatically uploaded on your behald when the run is completed.\n", + "* The Logs page allows you to view any log files created by your run. Logging runs created in notebooks typically do not generate log files.\n", + "* The Snapshot page contains a snapshot of the directory specified in the ''start_logging'' statement, plus the notebook at the time of the ''start_logging'' call. This snapshot and notebook can be downloaded from the Run Details page to continue or reproduce an experiment.\n", + "\n", + "### Logging string metrics\n", + "The following cell logs a string metric. A string metric is simply a string value associated with a name. A string metric String metrics are useful for labelling runs and to organize your data. Typically you should log all string parameters as metrics for later analysis - even information such as paths can help to understand how individual experiements perform differently.\n", + "\n", + "String metrics can be used in the following ways:\n", + "* Plot in hitograms\n", + "* Group by indicators for numerical plots\n", + "* Filtering runs\n", + "\n", + "String metrics appear in the **Tracked Metrics** section of the Run Details page and can be added as a column in Run History reports." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# log a string metric\n", + "run.log(name='Category', value=category)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Logging numerical metrics\n", + "The following cell logs some numerical metrics. Numerical metrics can include metrics such as AUC or MSE. You should log any parameter or significant output measure in order to understand trends across multiple experiments. Numerical metrics appear in the **Tracked Metrics** section of the Run Details page, and can be used in charts or KPI's in experiment Run History reports." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# log numerical values\n", + "run.log(name=\"scale factor\", value = scale_factor)\n", + "run.log(name='Magic Number', value=42 * scale_factor)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Logging vectors\n", + "Vectors are good for recording information such as loss curves. You can log a vector by create a list of numbers and call ``log_list()`` and supply a name and the list, or by repeatedly logging a value using the same name.\n", + "\n", + "Vectors are presented in Run Details as a chart, and are directly comparable in experiment reports when placed in a chart. **Note:** vectors logged into the run are expected to be relatively small. Logging very large vectors into Azure ML can result in reduced performance. If you need to store large amounts of data associated with the run, you can write the data to file that will be uploaded." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fibonacci_values = [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]\n", + "scaled_values = (i * scale_factor for i in fibonacci_values)\n", + "\n", + "# Log a list of values. Note this will generate a single-variable line chart.\n", + "run.log_list(name='Fibonacci', value=scaled_values)\n", + "\n", + "for i in tqdm(range(-10, 10)):\n", + " # log a metric value repeatedly, this will generate a single-variable line chart.\n", + " run.log(name='Sigmoid', value=1 / (1 + np.exp(-i)))\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Logging tables\n", + "Tables are good for recording related sets of information such as accuracy tables, confusion matrices, etc. \n", + "You can log a table in two ways:\n", + "* Create a dictionary of lists where each list represents a column in the table and call ``log_table()``\n", + "* Repeatedly call ``log_row()`` providing the same table name with a consistent set of named args as the column values\n", + "\n", + "Tables are presented in Run Details as a chart using the first two columns of the table **Note:** tables logged into the run are expected to be relatively small. Logging very large tables into Azure ML can result in reduced performance. If you need to store large amounts of data associated with the run, you can write the data to file that will be uploaded." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# create a dictionary to hold a table of values\n", + "sines = {}\n", + "sines['angle'] = []\n", + "sines['sine'] = []\n", + "\n", + "for i in tqdm(range(-10, 10)):\n", + " angle = i / 2.0 * scale_factor\n", + " \n", + " # log a 2 (or more) values as a metric repeatedly. This will generate a 2-variable line chart if you have 2 numerical columns.\n", + " run.log_row(name='Cosine Wave', angle=angle, cos=np.cos(angle))\n", + " \n", + " sines['angle'].append(angle)\n", + " sines['sine'].append(np.sin(angle))\n", + "\n", + "# log a dictionary as a table, this will generate a 2-variable chart if you have 2 numerical columns\n", + "run.log_table(name='Sine Wave', value=sines)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Logging images\n", + "You can directly log _matplotlib_ plots and arbitrary images to your run record. This code logs a _matplotlib_ pyplot object. Images show up in the run details page in the Azure ML Portal." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%matplotlib inline\n", + "\n", + "# Create a plot\n", + "import matplotlib.pyplot as plt\n", + "angle = np.linspace(-3, 3, 50) * scale_factor\n", + "plt.plot(angle,np.tanh(angle), label='tanh')\n", + "plt.legend(fontsize=12)\n", + "plt.title('Hyperbolic Tangent', fontsize=16)\n", + "plt.grid(True)\n", + "\n", + "# Log the plot to the run. To log an arbitrary image, use the form run.log_image(name, path='./image_path.png')\n", + "run.log_image(name='Hyperbolic Tangent', plot=plt)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Uploading files\n", + "\n", + "Any files that are placed in the ``.\\outputs`` directory are automatically uploaded when the run is completed. These files are also visible in the *Outputs* tab of the Run Details page. Files can also be uploaded explicitly and stored as artifacts along with the run record.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile .\\outputs\\myfile.txt\n", + "\n", + "This is an output file that will be automatically uploaded." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Completing the run\n", + "\n", + "Calling `run.complete()` marks the run as completed and triggers the output file collection. If for any reason you need to indicate the run failed or simply need to cancel the run you can call `run.fail()` or `run.cancel()`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run.complete()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "## Analyzing results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can refresh the run in the Azure portal to see all of your results. In many cases you will want to analyze runs that were performed previously to inspect the contents or compare results. Runs can be fetched from their parent Experiment object using the ``Run()`` constructor or the ``experiment.get_runs()`` method. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fetched_run = Run(experiment, run_id)\n", + "fetched_run" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Call ``run.get_metrics()`` to retrieve all the metrics from a run." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fetched_run.get_metrics()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "See the files uploaded for this run by calling ``run.get_file_names()``" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fetched_run.get_file_names()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once you know the file names in a run, you can download the files using the ``run.download_file()`` method" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "os.makedirs('files', exist_ok=True)\n", + "\n", + "for f in run.get_file_names():\n", + " dest = os.path.join('files', f.split('/')[-1])\n", + " print('Downloading file {} to {}...'.format(f, dest))\n", + " fetched_run.download_file(f, dest) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Tagging a run\n", + "Often when you analyze the results of a run, you may need to tag that run with important personal or external information. You can add a tag to a run using the ``run.tag()`` method. AzureML supports valueless and valued tags." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fetched_run.tag(\"My Favorite Run\")\n", + "fetched_run.tag(\"Competition Rank\", 1)\n", + "\n", + "fetched_run.get_tags()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Next steps\n", + "To experiment more with logging and to understand how metrics can be visualized, go back to the *Start a run* section, try changing the category and scale_factor values and going through the notebook several times. Play with the KPI, charting, and column selection options on the experiment's Run History reports page to see how the various metrics can be combined and visualized.\n", + "\n", + "After learning about all of the logging options, go to the [train on remote vm](..\\train_on_remote_vm\\train_on_remote_vm.ipnyb) notebook and experiment with logging from remote compute contexts." + ] + } + ], + "metadata": { + "authors": [ + { + "name": "roastala" + } ], - "metadata": { - "authors": [ - { - "name": "haining" - } - ], - "kernelspec": { - "display_name": "Python 3.6", - "language": "python", - "name": "python36" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.6" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} \ No newline at end of file + "kernelspec": { + "display_name": "Python 3.6", + "language": "python", + "name": "python36" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/how-to-use-azureml/training/manage-runs/hello.py b/how-to-use-azureml/training/manage-runs/hello.py new file mode 100644 index 000000000..69b754326 --- /dev/null +++ b/how-to-use-azureml/training/manage-runs/hello.py @@ -0,0 +1,7 @@ +# Copyright (c) Microsoft. All rights reserved. +# Licensed under the MIT license. + +from azureml.core import Run + +submitted_run = Run.get_context() +submitted_run.log(name="message", value="Hello from run!") diff --git a/how-to-use-azureml/training/manage-runs/hello_with_children.py b/how-to-use-azureml/training/manage-runs/hello_with_children.py new file mode 100644 index 000000000..953d88fe9 --- /dev/null +++ b/how-to-use-azureml/training/manage-runs/hello_with_children.py @@ -0,0 +1,11 @@ +# Copyright (c) Microsoft. All rights reserved. +# Licensed under the MIT license. + +from azureml.core import Run + +run = Run.get_context() + +child_runs = run.create_children(count=5) +for c, child in enumerate(child_runs): + child.log(name="Hello from child run ", value=c) + child.complete() diff --git a/how-to-use-azureml/training/manage-runs/hello_with_delay.py b/how-to-use-azureml/training/manage-runs/hello_with_delay.py new file mode 100644 index 000000000..aea75402e --- /dev/null +++ b/how-to-use-azureml/training/manage-runs/hello_with_delay.py @@ -0,0 +1,8 @@ +# Copyright (c) Microsoft. All rights reserved. +# Licensed under the MIT license. + +import time + +print("Wait for 10 seconds..") +time.sleep(10) +print("Done waiting") diff --git a/how-to-use-azureml/training/manage-runs/manage-runs.ipynb b/how-to-use-azureml/training/manage-runs/manage-runs.ipynb new file mode 100644 index 000000000..f183daf62 --- /dev/null +++ b/how-to-use-azureml/training/manage-runs/manage-runs.ipynb @@ -0,0 +1,595 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Copyright (c) Microsoft Corporation. All rights reserved.\n", + "\n", + "Licensed under the MIT License." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Manage runs\n", + "\n", + "## Table of contents\n", + "\n", + "1. [Introduction](#Introduction)\n", + "1. [Setup](#Setup)\n", + "1. [Start, monitor and complete a run](#Start,-monitor-and-complete-a-run)\n", + "1. [Add properties and tags](#Add-properties-and-tags)\n", + "1. [Query properties and tags](#Query-properties-and-tags)\n", + "1. [Start and query child runs](#Start-and-query-child-runs)\n", + "1. [Cancel or fail runs](#Cancel-or-fail-runs)\n", + "1. [Reproduce a run](#Reproduce-a-run)\n", + "1. [Next steps](#Next-steps)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Introduction\n", + "\n", + "When you're building enterprise-grade machine learning models, it is important to track, organize, monitor and reproduce your training runs. For example, you might want to trace the lineage behind a model deployed to production, and re-run the training experiment to troubleshoot issues. \n", + "\n", + "This notebooks shows examples how to use Azure Machine Learning services to manage your training runs." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "Make sure you go through the [configuration notebook](../../../configuration.ipynb) first if you haven't. Also, if you're new to Azure ML, we recommend that you go through [the tutorial](https://docs.microsoft.com/en-us/azure/machine-learning/service/tutorial-train-models-with-aml) first to learn the basic concepts.\n", + "\n", + "Let's first import required packages, check Azure ML SDK version, connect to your workspace and create an Experiment to hold the runs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import azureml.core\n", + "from azureml.core import Workspace, Experiment, Run\n", + "from azureml.core import ScriptRunConfig\n", + "\n", + "print(azureml.core.VERSION)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ws = Workspace.from_config()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "exp = Experiment(workspace=ws, name=\"explore-runs\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Start, monitor and complete a run\n", + "\n", + "A run is an unit of execution, typically to train a model, but for other purposes as well, such as loading or transforming data. Runs are tracked by Azure ML service, and can be instrumented with metrics and artifact logging.\n", + "\n", + "A simplest way to start a run in your interactive Python session is to call *Experiment.start_logging* method. You can then log metrics from within the run." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "notebook_run = exp.start_logging()\n", + "\n", + "notebook_run.log(name=\"message\", value=\"Hello from run!\")\n", + "\n", + "print(notebook_run.get_status())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Use *get_status method* to get the status of the run." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(notebook_run.get_status())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Also, you can simply enter the run to get a link to Azure Portal details" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "notebook_run" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Method *get_details* gives you more details on the run." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "notebook_run.get_details()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Use *complete* method to end the run." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "notebook_run.complete()\n", + "print(notebook_run.get_status())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also use Python's *with...as* pattern. The run will automatically complete when moving out of scope. This way you don't need to manually complete the run." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "with exp.start_logging() as notebook_run:\n", + " notebook_run.log(name=\"message\", value=\"Hello from run!\")\n", + " print(\"Is it still running?\",notebook_run.get_status())\n", + " \n", + "print(\"Has it completed?\",notebook_run.get_status())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, let's look at submitting a run as a separate Python process. To keep the example simple, we submit the run on local computer. Other targets could include remote VMs and Machine Learning Compute clusters in your Azure ML Workspace.\n", + "\n", + "We use *hello.py* script as an example. To perform logging, we need to get a reference to the Run instance from within the scope of the script. We do this using *Run.get_context* method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!more hello.py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's submit the run on a local computer. A standard pattern in Azure ML SDK is to create run configuration, and then use *Experiment.submit* method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run_config = ScriptRunConfig(source_directory='.', script='hello.py')\n", + "\n", + "local_script_run = exp.submit(run_config)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can view the status of the run as before" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(local_script_run.get_status())\n", + "local_script_run" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Submitted runs have additional log files you can inspect using *get_details_with_logs*." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "local_script_run.get_details_with_logs()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Use *wait_for_completion* method to block the local execution until remote run is complete." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "local_script_run.wait_for_completion(show_output=True)\n", + "print(local_script_run.get_status())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Add properties and tags\n", + "\n", + "Properties and tags help you organize your runs. You can use them to describe, for example, who authored the run, what the results were, and what machine learning approach was used. And as you'll later learn, properties and tags can be used to query the history of your runs to find the important ones.\n", + "\n", + "For example, let's add \"author\" property to the run:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "local_script_run.add_properties({\"author\":\"azureml-user\"})\n", + "print(local_script_run.get_properties())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Properties are immutable. Once you assign a value it cannot be changed, making them useful as a permanent record for auditing purposes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "try:\n", + " local_script_run.add_properties({\"author\":\"different-user\"})\n", + "except Exception as e:\n", + " print(e)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Tags on the other hand can be changed:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "local_script_run.tag(\"quality\", \"great run\")\n", + "print(local_script_run.get_tags())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "local_script_run.tag(\"quality\", \"fantastic run\")\n", + "print(local_script_run.get_tags())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also add a simple string tag. It appears in the tag dictionary with value of None" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "local_script_run.tag(\"worth another look\")\n", + "print(local_script_run.get_tags())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Query properties and tags\n", + "\n", + "You can quary runs within an experiment that match specific properties and tags. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "list(exp.get_runs(properties={\"author\":\"azureml-user\"},tags={\"quality\":\"fantastic run\"}))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "list(exp.get_runs(properties={\"author\":\"azureml-user\"},tags=\"worth another look\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Start and query child runs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can use child runs to group together related runs, for example different hyperparameter tuning iterations.\n", + "\n", + "Let's use *hello_with_children* script to create a batch of 5 child runs from within a submitted run." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!more hello_with_children.py" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run_config = ScriptRunConfig(source_directory='.', script='hello_with_children.py')\n", + "\n", + "local_script_run = exp.submit(run_config)\n", + "local_script_run.wait_for_completion(show_output=True)\n", + "print(local_script_run.get_status())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can start child runs one by one. Note that this is less efficient than submitting a batch of runs, because each creation results in a network call.\n", + "\n", + "Child runs too complete automatically as they move out of scope." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "with exp.start_logging() as parent_run:\n", + " for c,count in enumerate(range(5)):\n", + " with parent_run.child_run() as child:\n", + " child.log(name=\"Hello from child run\", value=c)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To query the child runs belonging to specific parent, use *get_children* method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "list(parent_run.get_children())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cancel or fail runs\n", + "\n", + "Sometimes, you realize that the run is not performing as intended, and you want to cancel it instead of waiting for it to complete.\n", + "\n", + "As an example, let's create a Python script with a delay in the middle." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!more hello_with_delay.py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can use *cancel* method to cancel a run." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run_config = ScriptRunConfig(source_directory='.', script='hello_with_delay.py')\n", + "\n", + "local_script_run = exp.submit(run_config)\n", + "print(\"Did the run start?\",local_script_run.get_status())\n", + "local_script_run.cancel()\n", + "print(\"Did the run cancel?\",local_script_run.get_status())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also mark an unsuccessful run as failed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "local_script_run = exp.submit(run_config)\n", + "local_script_run.fail()\n", + "print(local_script_run.get_status())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Reproduce a run\n", + "\n", + "When updating or troubleshooting on a model deployed to production, you sometimes need to revisit the original training run that produced the model. To help you with this, Azure ML service by default creates snapshots of your scripts a the time of run submission:\n", + "\n", + "You can use *restore_snapshot* to obtain a zip package of the latest snapshot of the script folder. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "local_script_run.restore_snapshot(path=\"snapshots\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can then extract the zip package, examine the code, and submit your run again." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Next steps\n", + "\n", + " * To learn more about logging APIs, see [logging API notebook](./logging-api/logging-api.ipynb)\n", + " * To learn more about remote runs, see [train on AML compute notebook](./train-on-amlcompute/train-on-amlcompute.ipynb)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "authors": [ + { + "name": "roastala" + } + ], + "kernelspec": { + "display_name": "Python 3.6", + "language": "python", + "name": "python36" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/tutorials/img-classification-part1-training.ipynb b/tutorials/img-classification-part1-training.ipynb index 2d5a663d7..4171c4925 100644 --- a/tutorials/img-classification-part1-training.ipynb +++ b/tutorials/img-classification-part1-training.ipynb @@ -23,8 +23,7 @@ "\n", "> * Set up your development environment\n", "> * Access and examine the data\n", - "> * Train a simple logistic regression model locally using the popular scikit-learn machine learning library \n", - "> * Train multiple models on a remote cluster\n", + "> * Train a simple logistic regression model on a remote cluster\n", "> * Review training results, find and register the best model\n", "\n", "You'll learn how to select a model and deploy it in [part two of this tutorial](deploy-models.ipynb) later. \n",