E2E Census code updated, modin, scikit-learn extension and readme fixed (oneapi-src#641)

vsanghavi · web-flow · commit 458718132aaf · 2021-09-01T07:56:39.000-07:00
* updates to README

* Update README.md

* update to Jupyter notebook

* Census readme updated

* Census readme updated

* sample.json updated for Census

* updates to sample.json

* typo fixed

* updates to sample.json
diff --git a/AI-and-Analytics/End-to-end-Workloads/Census/README.md b/AI-and-Analytics/End-to-end-Workloads/Census/README.md
@@ -1,26 +1,26 @@
 # End-to-end Machine Learning Workload: `Census` Sample
 
-This sample code illustrates how to use Intel® Distribution of Modin for ETL operations and ridge regression algorithm from the Intel® oneAPI Data Analytics Library (oneDAL) accelerated scikit-learn library to build and run an end to end machine learning workload. Both Intel® Distribution of Modin and oneDAL accelerated scikit-learn libraries are available together in [Intel&reg; oneAPI AI Analytics Toolkit](https://software.intel.com/content/www/us/en/develop/tools/oneapi/ai-analytics-toolkit.html). This sample code demonstrates how to seamlessly run the end-to-end census workload using the toolkit, without any external dependencies.
+This sample code illustrates how to use Intel® Distribution of Modin for ETL operations and ridge regression algorithm from the Intel® extension of scikit-learn library to build and run an end to end machine learning workload. Both Intel® Distribution of Modin and  Intel® Extension for Scikit-learn libraries are available together in [Intel&reg; oneAPI AI Analytics Toolkit](https://software.intel.com/content/www/us/en/develop/tools/oneapi/ai-analytics-toolkit.html). This sample code demonstrates how to seamlessly run the end-to-end census workload using the toolkit, without any external dependencies.
 
 | Optimized for                     | Description
 | :---                              | :---
 | OS                                | 64-bit Linux: Ubuntu 18.04 or higher
 | Hardware                          | Intel Atom® Processors; Intel® Core™ Processor Family; Intel® Xeon® Processor Family; Intel® Xeon® Scalable Performance Processor Family
-| Software                          | Intel® AI Analytics Toolkit (Python version 3.7, Intel® Distribution of Modin , Ray, Intel® oneAPI Data Analytics Library (oneDAL), Scikit-Learn, NumPy)
-| What you will learn               | How to use Intel® Distribution of Modin and oneDAL optimized scikit-learn (developed and owned by Intel) to build end to end ML workloads and gain performance.
+| Software                          | Intel® AI Analytics Toolkit (Python version 3.7, Intel® Distribution of Modin , Ray, Intel® Extension for Scikit-Learn, NumPy)
+| What you will learn               | How to use Intel® Distribution of Modin and Intel® Extension for Scikit-learn to build end to end ML workloads and gain performance.
 | Time to complete                  | 15-18 minutes
 
 ## Purpose
-Intel® Distribution of Modin uses Ray to provide an effortless way to speed up your Pandas notebooks, scripts and libraries. Unlike other distributed DataFrame libraries, Intel® Distribution of Modin provides seamless integration and compatibility with existing Pandas code. Daal4py is a simplified API to Intel oneDAL that allows for fast usage of the framework suited for Data Scientists and Machine Learning users. It is built to help provide an abstraction to Intel® oneDAL for either direct usage or integration into one's own framework.
+Intel® Distribution of Modin uses Ray to provide an effortless way to speed up your Pandas notebooks, scripts and libraries. Unlike other distributed DataFrame libraries, Intel® Distribution of Modin provides seamless integration and compatibility with existing Pandas code. Intel(R) Extension for Scikit-learn dynamically patches scikit-learn estimators to use Intel(R) oneAPI Data Analytics Library as the underlying solver, while getting the same solution faster.
 
 #### Model and dataset
 In this sample, you will use Intel® Distribution of Modin to ingest and process U.S. census data from 1970 to 2010 in order to build a ridge regression based model to find the relation between education and the total income earned in the US.
-Data transformation stage normalizes the income to the yearly inflation, balances the data such that each year has a similar number of data points, and extracts the features from the transformed dataset. The feature vectors are fed into the ridge regression model to predict the income of each sample.
+Data transformation stage normalizes the income to the yearly inflation, balances the data such that each year has a similar number of data points, and extracts the features from the transformed dataset. The feature vectors are fed into the ridge regression model to predict the education of each sample.
 
 Dataset is from IPUMS USA, University of Minnesota, [www.ipums.org](https://ipums.org/) (Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas and Matthew Sobek. IPUMS USA: Version 10.0 [dataset]. Minneapolis, MN: IPUMS, 2020. https://doi.org/10.18128/D010.V10.0)
 
 ## Key Implementation Details
-This end-to-end workload sample code is implemented for CPU using the Python language.  With the installation of Intel AI Analytics Toolkit, the conda environment is prepared with Python version 3.7, Intel® Distribution of Modin , Ray, Intel® oneAPI Data Analytics Library (oneDAL), Scikit-Learn, NumPy following which the sample code can be directly run using the underlying steps in this README. 
+This end-to-end workload sample code is implemented for CPU using the Python language.  With the installation of Intel AI Analytics Toolkit, the conda environment is prepared with Python version 3.7, Intel® Distribution of Modin , Ray, Intel® Extension for Scikit-Learn, NumPy following which the sample code can be directly run using the underlying steps in this README. 
 
 ## License
 
@@ -29,70 +29,52 @@ Code samples are licensed under the MIT license. See
 
 Third party program Licenses can be found here: [third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt).
 
-## Running Samples on the Intel&reg; DevCloud
-If you are running this sample on the Intel&reg; DevCloud, skip the Pre-requirements and go to the [Activate Conda Environment](#activate-conda) section.
+## Building Intel® Distribution of Modin and Intel® Extension for Scikit-learn for CPU to build and run end-to-end workload
+Intel® Distribution of Modin and Intel® Extension for Scikit-learn is ready for use once you finish the Intel AI Analytics Toolkit installation with the Conda Package Manager.
 
-## Building Intel® Distribution of Modin and Intel® oneAPI Data Analytics Library (oneDAL) for CPU to build and run end-to-end workload
+You can refer to the oneAPI [main page](https://software.intel.com/en-us/oneapi), and the Intel® oneAPI Toolkit [Installation Guide](https://software.intel.com/content/www/us/en/develop/documentation/installation-guide-for-intel-oneapi-toolkits-linux/top/installation/install-using-package-managers/conda/install-intel-ai-analytics-toolkit-via-conda.html) for conda environment setup and installation steps.
 
-### Pre-requirements (Local or Remote Host Installation)
-Intel® Distribution of Modin and Intel® oneAPI Data Analytics Library (oneDAL) is ready for use once you finish the Intel AI Analytics Toolkit installation with the Conda Package Manager.
-
-You can refer to the oneAPI [main page](https://software.intel.com/en-us/oneapi), and the Toolkit [Getting Started Guide for Linux](https://software.intel.com/content/www/us/en/develop/documentation/get-started-with-ai-linux/top.html) for installation steps and scripts.
-
-### Activate conda environment With Root Access<a name="activate-conda"></a>
-
-In the Linux shell, navigate to your oneapi installation path, typically `/opt/intel/oneapi/` when installed as root or sudo, and `~/intel/oneapi/` when not installed as a super user. 
-
-Activate the conda environment with the following command:
-
-	@@ -48,7 +52,7 @@ source activate intel-aikit-modin
-
-### Activate conda environment Without Root Access (Optional)
-
-By default, the Intel oneAPI AI Analytics toolkit is installed in the `oneapi` folder, which requires root privileges to manage it. If you would like to bypass using root access to manage your conda environment, then you can clone your desired conda environment using the following command:
+### Activate conda environment
 
+To install the Intel Distribution of Modin python environment, use the following command:
 #### Linux
 ```
-	@@ -62,9 +66,9 @@ conda activate intel-aikit-modin
+conda create -n aikit-modin --override-channels intel-aikit-modin omniscidbe4py python=3.7 -c intel -c conda-forge
+```
+Then activate your conda environment with the following command:
+```
+conda activate aikit-modin
 ```
 
+Additionally, install the following in the conda environment
 
-### Install Jupyter Notebook*
-
-Launch Jupyter Notebook in the directory housing the code example.
-
+### Install Jupyter Notebook
+Needed to launch Jupyter Notebook in the directory housing the code example
 ```
 conda install jupyter nb_conda_kernels
-	@@ -76,7 +80,7 @@ pip install jupyter
-
-### Install wget package
-
-Install wget package to retrieve the Census dataset using HTTPS.
-
 ```
-pip install wget
-	@@ -85,7 +89,7 @@ pip install wget
-#### View in Jupyter Notebook
 
+### ray-dashboard and opencensus
+```
+conda install ray-dashboard
+pip install opencensus
+```
 
-Launch Jupyter Notebook in the directory housing the code example.
-
+#### View in Jupyter Notebook
+Launch Jupyter Notebook in the directory housing the code example
 ```
 jupyter notebook
-	@@ -112,3 +116,16 @@ Run the Program
+```
+## Running the end-to-end code sample
+### Run as Jupyter Notebook
+Open .ipynb file and run cells in Jupyter Notebook using the "Run" button. Alternatively, the entire workbook can be run using the "Restart kernel and re-run whole notebook" button. (see image below using "census modin" sample)
+![Click the Run Button in the Jupyter Notebook](Running_Jupyter_notebook.jpg "Run Button on Jupyter Notebook")
+
+### Run as Python File
+Open notebook in Jupyter and download as python file (see image using "census modin" sample)
+![Download as python file in the Jupyter Notebook](Running_Jupyter_notebook_as_Python.jpg "Download as python file in the Jupyter Notebook")
+Run the Program
+`python census_modin.py`
 ##### Expected Printed Output:
 Expected Cell Output shown for census_modin.ipynb:
 ![Output](Expected_output.jpg "Expected output for Jupyter Notebook")
-
-
-### Request a Compute Node
-In order to run on the DevCloud, you need to request a compute node using node properties such as: `gpu`, `xeon`, `fpga_compile`, `fpga_runtime` and others. For more information about the node properties, execute the `pbsnodes` command.
- This node information must be provided when submitting a job to run your sample in batch mode using the qsub command. When you see the qsub command in the Run section of the [Hello World instructions](https://devcloud.intel.com/oneapi/get_started/aiAnalyticsToolkitSamples/), change the command to fit the node you are using. Nodes which are in bold indicate they are compatible with this sample:
-
-<!---Mark each compatible Node in BOLD-->
-| Node              | Command                                                 |
-| ----------------- | ------------------------------------------------------- |
-| GPU               | qsub -l nodes=1:gpu:ppn=2 -d . hello-world.sh           |
-| CPU               | qsub -l nodes=1:xeon:ppn=2 -d . hello-world.sh          |
-| FPGA Compile Time | qsub -l nodes=1:fpga\_compile:ppn=2 -d . hello-world.sh |
-| FPGA Runtime      | qsub -l nodes=1:fpga\_runtime:ppn=2 -d . hello-world.sh |
diff --git a/AI-and-Analytics/End-to-end-Workloads/Census/census_modin.ipynb b/AI-and-Analytics/End-to-end-Workloads/Census/census_modin.ipynb
@@ -17,7 +17,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Census with Modin and Intel® Data Analytics and Acceleration Library (DAAL) Accelerated Scikit-Learn"
+    "# End-to-end Census workload with Intel® Distribution of Modin and Intel® Extension for Scikit-learn"
    ]
   },
   {
@@ -28,9 +28,8 @@
     }
    },
    "source": [
-    "In this example we will be building an end to end machine learning workload with US census from 1970 to 2010.\n",
-    "It uses Modin with Ray as backend compute engine for ETL, and uses Ridge Regression from DAAL accelerated scikit-learn library\n",
-    "to train and predict the US total income with education information."
+    "In this example we will be running an end-to-end machine learning workload with US census data from 1970 to 2010.\n",
+    "It uses Intel® Distribution of Modin with Ray as backend compute engine for ETL, and uses Ridge Regression algorithm from Intel scikit-learn-extension library to train and predict the co-relation between US total income and education levels."
    ]
   },
   {
@@ -55,30 +54,19 @@
   },
   {
    "cell_type": "markdown",
-   "metadata": {
-    "pycharm": {
-     "name": "#%% md\n"
-    }
-   },
+   "metadata": {},
    "source": [
-    "Import"
+    "Import basic python modules"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "pycharm": {
-     "name": "#%%\n"
-    }
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "import os\n",
-    "import numpy as np\n",
-    "\n",
-    "from sklearn import config_context\n",
-    "from sklearn.metrics import mean_squared_error, r2_score"
+    "import numpy as np"
    ]
   },
   {
@@ -89,7 +77,7 @@
     }
    },
    "source": [
-    "Import Modin and set Ray as the compute engine"
+    "Import Modin and set Ray as the compute engine. This engine uses analytical database OmniSciDB to obtain high single-node scalability for specific set of dataframe operations. "
    ]
   },
   {
@@ -102,8 +90,11 @@
    },
    "outputs": [],
    "source": [
-    "import modin.pandas as pd\n",
-    "os.environ[\"MODIN_ENGINE\"] = \"ray\""
+    "#import modin.pandas as pd\n",
+    "os.environ[\"MODIN_ENGINE\"] = \"ray\"\n",
+    "os.environ[\"MODIN_BACKEND\"] = \"omnisci\"\n",
+    "os.environ[\"MODIN_EXPERIMENTAL\"] = \"True\"\n",
+    "import modin.pandas as pd"
    ]
   },
   {
@@ -114,7 +105,7 @@
     }
    },
    "source": [
-    "Load DAAL accelerated sklearn patch and import packages from the patch"
+    "Import Intel(R) Extension for Scikit-learn which dynamically patches scikit-learn estimators to use Intel(R) oneAPI Data Analytics Library as the underlying solver, while getting the same solution faster."
    ]
   },
   {
@@ -127,9 +118,11 @@
    },
    "outputs": [],
    "source": [
-    "import daal4py.sklearn\n",
-    "daal4py.sklearn.patch_sklearn()\n",
+    "from sklearnex import patch_sklearn\n",
+    "patch_sklearn()\n",
     "\n",
+    "from sklearn import config_context\n",
+    "from sklearn.metrics import mean_squared_error, r2_score\n",
     "from sklearn.model_selection import train_test_split\n",
     "import sklearn.linear_model as lm"
    ]
@@ -142,7 +135,7 @@
     }
    },
    "source": [
-    "Read the data from the downloaded archive file"
+    "Read and load the data into a dataframe from the downloaded archive file"
    ]
   },
   {
@@ -166,7 +159,7 @@
     }
    },
    "source": [
-    "ETL"
+    "Run ETL operations to prepare and transform the ingested dataset into a form that can be readily consumed by the ridge regression algorithm. Keep columns that are relevant, clean up the samples with invalid income, education and normalize the income to account for yearly inflation"
    ]
   },
   {
@@ -209,7 +202,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Train the model and predict the income"
+    "Train the model and run prediction. Loop 50 times to remove any bias in splitting the dataset into train & test set, in order to reduce chance of over-fitting from selecting a train set that fits the model too well to the test set"
    ]
   },
   {
@@ -254,7 +247,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Check the regression results: mean squared error and r square score"
+    "Check the regression results by calculating the accuracy of the prediction using mean squared error and r square score"
    ]
   },
   {
@@ -275,92 +268,13 @@
     "print(\"mean COD ± deviation: {:.9f} ± {:.9f}\".format(mean_cod, cod_dev))"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "mean_mse = sum(mse_values)/len(mse_values)\n",
-    "mean_cod = sum(cod_values)/len(cod_values)\n",
-    "mse_dev = pow(sum([(mse_value - mean_mse)**2 for mse_value in mse_values])/(len(mse_values) - 1), 0.5)\n",
-    "cod_dev = pow(sum([(cod_value - mean_cod)**2 for cod_value in cod_values])/(len(cod_values) - 1), 0.5)\n",
-    "print(\"mean MSE ± deviation: {:.9f} ± {:.9f}\".format(mean_mse, mse_dev))\n",
-    "print(\"mean COD ± deviation: {:.9f} ± {:.9f}\".format(mean_cod, cod_dev))\n"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Train the model and predict the income"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "pycharm": {
-     "name": "#%%\n"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "# ML - training and inference\n",
-    "clf = lm.Ridge()\n",
-    "\n",
-    "mse_values, cod_values = [], []\n",
-    "N_RUNS = 50\n",
-    "TRAIN_SIZE = 0.9\n",
-    "random_state = 777\n",
-    "\n",
-    "X = np.ascontiguousarray(X, dtype=np.float64)\n",
-    "y = np.ascontiguousarray(y, dtype=np.float64)\n",
-    "\n",
-    "# cross validation\n",
-    "for i in range(N_RUNS):\n",
-    "    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=TRAIN_SIZE,\n",
-    "                                                        random_state=random_state)\n",
-    "    random_state += 777\n",
-    "\n",
-    "    # training\n",
-    "    with config_context(assume_finite=True):\n",
-    "        model = clf.fit(X_train, y_train)\n",
-    "\n",
-    "    # inference\n",
-    "    y_pred = model.predict(X_test)\n",
-    "\n",
-    "    mse_values.append(mean_squared_error(y_test, y_pred))\n",
-    "    cod_values.append(r2_score(y_test, y_pred))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "pycharm": {
-     "name": "#%% md\n"
-    }
-   },
-   "source": [
-    "Check the regression results: mean squared error and r square score"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "pycharm": {
-     "name": "#%%\n"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "mean_mse = sum(mse_values)/len(mse_values)\n",
-    "mean_cod = sum(cod_values)/len(cod_values)\n",
-    "mse_dev = pow(sum([(mse_value - mean_mse)**2 for mse_value in mse_values])/(len(mse_values) - 1), 0.5)\n",
-    "cod_dev = pow(sum([(cod_value - mean_cod)**2 for cod_value in cod_values])/(len(cod_values) - 1), 0.5)\n",
-    "print(\"mean MSE ± deviation: {:.9f} ± {:.9f}\".format(mean_mse, mse_dev))\n",
-    "print(\"mean COD ± deviation: {:.9f} ± {:.9f}\".format(mean_cod, cod_dev))\n"
+    "Verify the accuracy:\n",
+    "mean MSE ± deviation: 0.032564569 ± 0.000041799\n",
+    "mean COD ± deviation: 0.995367533 ± 0.000005869"
    ]
   }
  ],
@@ -380,7 +294,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.7"
+   "version": "3.7.11"
   }
  },
  "nbformat": 4,
diff --git a/AI-and-Analytics/End-to-end-Workloads/Census/sample.json b/AI-and-Analytics/End-to-end-Workloads/Census/sample.json
@@ -2,7 +2,7 @@
   "guid": "AA055D7B-290C-4FCA-990B-B9FC88AF18D4",
   "name": "Census",
   "categories": ["Toolkit/oneAPI AI And Analytics/End-to-End Workloads"],
-  "description": "This sample illustrates using Modin and daal optimized scikit-learn to build and run an end-to-end machine learning workload",
+  "description": "This sample illustrates the use of Intel Distrabution of Modin and Intel Extension for Scikit-learn to build and run an end-to-end machine learning workload",
   "builder": ["cli"],
   "languages": [{"python":{}}],
   "dependencies": ["intelpython"],
@@ -11,8 +11,11 @@
   "ciTests": {
   	"linux": [
     {
-  		"env": ["source /opt/intel/oneapi/setvars.sh --force", "conda create -y -n aikit-modin-test -c intel -c conda-forge runipy intel-aikit-modin", "conda activate aikit-modin-test"],
+  		"env": ["source activate base"],
   		"steps": [
+         "conda create -y -n aikit-modin --override-channels intel-aikit-modin omniscidbe4py python=3.7 runipy ray-dashboard -c intel -c conda-forge",
+         "conda activate aikit-modin",
+         "pip install opencensus",
          "runipy census_modin.ipynb" 
   		 ]
   	}