Skip to content

Commit 4587181

Browse files
authored
E2E Census code updated, modin, scikit-learn extension and readme fixed (oneapi-src#641)
* updates to README * Update README.md * update to Jupyter notebook * Census readme updated * Census readme updated * sample.json updated for Census * updates to sample.json * typo fixed * updates to sample.json
1 parent c73e1df commit 4587181

File tree

3 files changed

+68
-169
lines changed

3 files changed

+68
-169
lines changed
Lines changed: 37 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,26 @@
11
# End-to-end Machine Learning Workload: `Census` Sample
22

3-
This sample code illustrates how to use Intel® Distribution of Modin for ETL operations and ridge regression algorithm from the Intel® oneAPI Data Analytics Library (oneDAL) accelerated scikit-learn library to build and run an end to end machine learning workload. Both Intel® Distribution of Modin and oneDAL accelerated scikit-learn libraries are available together in [Intel® oneAPI AI Analytics Toolkit](https://software.intel.com/content/www/us/en/develop/tools/oneapi/ai-analytics-toolkit.html). This sample code demonstrates how to seamlessly run the end-to-end census workload using the toolkit, without any external dependencies.
3+
This sample code illustrates how to use Intel® Distribution of Modin for ETL operations and ridge regression algorithm from the Intel® extension of scikit-learn library to build and run an end to end machine learning workload. Both Intel® Distribution of Modin and Intel® Extension for Scikit-learn libraries are available together in [Intel® oneAPI AI Analytics Toolkit](https://software.intel.com/content/www/us/en/develop/tools/oneapi/ai-analytics-toolkit.html). This sample code demonstrates how to seamlessly run the end-to-end census workload using the toolkit, without any external dependencies.
44

55
| Optimized for | Description
66
| :--- | :---
77
| OS | 64-bit Linux: Ubuntu 18.04 or higher
88
| Hardware | Intel Atom® Processors; Intel® Core™ Processor Family; Intel® Xeon® Processor Family; Intel® Xeon® Scalable Performance Processor Family
9-
| Software | Intel® AI Analytics Toolkit (Python version 3.7, Intel® Distribution of Modin , Ray, Intel® oneAPI Data Analytics Library (oneDAL), Scikit-Learn, NumPy)
10-
| What you will learn | How to use Intel® Distribution of Modin and oneDAL optimized scikit-learn (developed and owned by Intel) to build end to end ML workloads and gain performance.
9+
| Software | Intel® AI Analytics Toolkit (Python version 3.7, Intel® Distribution of Modin , Ray, Intel® Extension for Scikit-Learn, NumPy)
10+
| What you will learn | How to use Intel® Distribution of Modin and Intel® Extension for Scikit-learn to build end to end ML workloads and gain performance.
1111
| Time to complete | 15-18 minutes
1212

1313
## Purpose
14-
Intel® Distribution of Modin uses Ray to provide an effortless way to speed up your Pandas notebooks, scripts and libraries. Unlike other distributed DataFrame libraries, Intel® Distribution of Modin provides seamless integration and compatibility with existing Pandas code. Daal4py is a simplified API to Intel oneDAL that allows for fast usage of the framework suited for Data Scientists and Machine Learning users. It is built to help provide an abstraction to Intel® oneDAL for either direct usage or integration into one's own framework.
14+
Intel® Distribution of Modin uses Ray to provide an effortless way to speed up your Pandas notebooks, scripts and libraries. Unlike other distributed DataFrame libraries, Intel® Distribution of Modin provides seamless integration and compatibility with existing Pandas code. Intel(R) Extension for Scikit-learn dynamically patches scikit-learn estimators to use Intel(R) oneAPI Data Analytics Library as the underlying solver, while getting the same solution faster.
1515

1616
#### Model and dataset
1717
In this sample, you will use Intel® Distribution of Modin to ingest and process U.S. census data from 1970 to 2010 in order to build a ridge regression based model to find the relation between education and the total income earned in the US.
18-
Data transformation stage normalizes the income to the yearly inflation, balances the data such that each year has a similar number of data points, and extracts the features from the transformed dataset. The feature vectors are fed into the ridge regression model to predict the income of each sample.
18+
Data transformation stage normalizes the income to the yearly inflation, balances the data such that each year has a similar number of data points, and extracts the features from the transformed dataset. The feature vectors are fed into the ridge regression model to predict the education of each sample.
1919

2020
Dataset is from IPUMS USA, University of Minnesota, [www.ipums.org](https://ipums.org/) (Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas and Matthew Sobek. IPUMS USA: Version 10.0 [dataset]. Minneapolis, MN: IPUMS, 2020. https://doi.org/10.18128/D010.V10.0)
2121

2222
## Key Implementation Details
23-
This end-to-end workload sample code is implemented for CPU using the Python language. With the installation of Intel AI Analytics Toolkit, the conda environment is prepared with Python version 3.7, Intel® Distribution of Modin , Ray, Intel® oneAPI Data Analytics Library (oneDAL), Scikit-Learn, NumPy following which the sample code can be directly run using the underlying steps in this README.
23+
This end-to-end workload sample code is implemented for CPU using the Python language. With the installation of Intel AI Analytics Toolkit, the conda environment is prepared with Python version 3.7, Intel® Distribution of Modin , Ray, Intel® Extension for Scikit-Learn, NumPy following which the sample code can be directly run using the underlying steps in this README.
2424

2525
## License
2626

@@ -29,70 +29,52 @@ Code samples are licensed under the MIT license. See
2929

3030
Third party program Licenses can be found here: [third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt).
3131

32-
## Running Samples on the Intel® DevCloud
33-
If you are running this sample on the Intel® DevCloud, skip the Pre-requirements and go to the [Activate Conda Environment](#activate-conda) section.
32+
## Building Intel® Distribution of Modin and Intel® Extension for Scikit-learn for CPU to build and run end-to-end workload
33+
Intel® Distribution of Modin and Intel® Extension for Scikit-learn is ready for use once you finish the Intel AI Analytics Toolkit installation with the Conda Package Manager.
3434

35-
## Building Intel® Distribution of Modin and Intel® oneAPI Data Analytics Library (oneDAL) for CPU to build and run end-to-end workload
35+
You can refer to the oneAPI [main page](https://software.intel.com/en-us/oneapi), and the Intel® oneAPI Toolkit [Installation Guide](https://software.intel.com/content/www/us/en/develop/documentation/installation-guide-for-intel-oneapi-toolkits-linux/top/installation/install-using-package-managers/conda/install-intel-ai-analytics-toolkit-via-conda.html) for conda environment setup and installation steps.
3636

37-
### Pre-requirements (Local or Remote Host Installation)
38-
Intel® Distribution of Modin and Intel® oneAPI Data Analytics Library (oneDAL) is ready for use once you finish the Intel AI Analytics Toolkit installation with the Conda Package Manager.
39-
40-
You can refer to the oneAPI [main page](https://software.intel.com/en-us/oneapi), and the Toolkit [Getting Started Guide for Linux](https://software.intel.com/content/www/us/en/develop/documentation/get-started-with-ai-linux/top.html) for installation steps and scripts.
41-
42-
### Activate conda environment With Root Access<a name="activate-conda"></a>
43-
44-
In the Linux shell, navigate to your oneapi installation path, typically `/opt/intel/oneapi/` when installed as root or sudo, and `~/intel/oneapi/` when not installed as a super user.
45-
46-
Activate the conda environment with the following command:
47-
48-
@@ -48,7 +52,7 @@ source activate intel-aikit-modin
49-
50-
### Activate conda environment Without Root Access (Optional)
51-
52-
By default, the Intel oneAPI AI Analytics toolkit is installed in the `oneapi` folder, which requires root privileges to manage it. If you would like to bypass using root access to manage your conda environment, then you can clone your desired conda environment using the following command:
37+
### Activate conda environment
5338

39+
To install the Intel Distribution of Modin python environment, use the following command:
5440
#### Linux
5541
```
56-
@@ -62,9 +66,9 @@ conda activate intel-aikit-modin
42+
conda create -n aikit-modin --override-channels intel-aikit-modin omniscidbe4py python=3.7 -c intel -c conda-forge
43+
```
44+
Then activate your conda environment with the following command:
45+
```
46+
conda activate aikit-modin
5747
```
5848

49+
Additionally, install the following in the conda environment
5950

60-
### Install Jupyter Notebook*
61-
62-
Launch Jupyter Notebook in the directory housing the code example.
63-
51+
### Install Jupyter Notebook
52+
Needed to launch Jupyter Notebook in the directory housing the code example
6453
```
6554
conda install jupyter nb_conda_kernels
66-
@@ -76,7 +80,7 @@ pip install jupyter
67-
68-
### Install wget package
69-
70-
Install wget package to retrieve the Census dataset using HTTPS.
71-
7255
```
73-
pip install wget
74-
@@ -85,7 +89,7 @@ pip install wget
75-
#### View in Jupyter Notebook
7656

57+
### ray-dashboard and opencensus
58+
```
59+
conda install ray-dashboard
60+
pip install opencensus
61+
```
7762

78-
Launch Jupyter Notebook in the directory housing the code example.
79-
63+
#### View in Jupyter Notebook
64+
Launch Jupyter Notebook in the directory housing the code example
8065
```
8166
jupyter notebook
82-
@@ -112,3 +116,16 @@ Run the Program
67+
```
68+
## Running the end-to-end code sample
69+
### Run as Jupyter Notebook
70+
Open .ipynb file and run cells in Jupyter Notebook using the "Run" button. Alternatively, the entire workbook can be run using the "Restart kernel and re-run whole notebook" button. (see image below using "census modin" sample)
71+
![Click the Run Button in the Jupyter Notebook](Running_Jupyter_notebook.jpg "Run Button on Jupyter Notebook")
72+
73+
### Run as Python File
74+
Open notebook in Jupyter and download as python file (see image using "census modin" sample)
75+
![Download as python file in the Jupyter Notebook](Running_Jupyter_notebook_as_Python.jpg "Download as python file in the Jupyter Notebook")
76+
Run the Program
77+
`python census_modin.py`
8378
##### Expected Printed Output:
8479
Expected Cell Output shown for census_modin.ipynb:
8580
![Output](Expected_output.jpg "Expected output for Jupyter Notebook")
86-
87-
88-
### Request a Compute Node
89-
In order to run on the DevCloud, you need to request a compute node using node properties such as: `gpu`, `xeon`, `fpga_compile`, `fpga_runtime` and others. For more information about the node properties, execute the `pbsnodes` command.
90-
This node information must be provided when submitting a job to run your sample in batch mode using the qsub command. When you see the qsub command in the Run section of the [Hello World instructions](https://devcloud.intel.com/oneapi/get_started/aiAnalyticsToolkitSamples/), change the command to fit the node you are using. Nodes which are in bold indicate they are compatible with this sample:
91-
92-
<!---Mark each compatible Node in BOLD-->
93-
| Node | Command |
94-
| ----------------- | ------------------------------------------------------- |
95-
| GPU | qsub -l nodes=1:gpu:ppn=2 -d . hello-world.sh |
96-
| CPU | qsub -l nodes=1:xeon:ppn=2 -d . hello-world.sh |
97-
| FPGA Compile Time | qsub -l nodes=1:fpga\_compile:ppn=2 -d . hello-world.sh |
98-
| FPGA Runtime | qsub -l nodes=1:fpga\_runtime:ppn=2 -d . hello-world.sh |

AI-and-Analytics/End-to-end-Workloads/Census/census_modin.ipynb

Lines changed: 26 additions & 112 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
"cell_type": "markdown",
1818
"metadata": {},
1919
"source": [
20-
"# Census with Modin and Intel® Data Analytics and Acceleration Library (DAAL) Accelerated Scikit-Learn"
20+
"# End-to-end Census workload with Intel® Distribution of Modin and Intel® Extension for Scikit-learn"
2121
]
2222
},
2323
{
@@ -28,9 +28,8 @@
2828
}
2929
},
3030
"source": [
31-
"In this example we will be building an end to end machine learning workload with US census from 1970 to 2010.\n",
32-
"It uses Modin with Ray as backend compute engine for ETL, and uses Ridge Regression from DAAL accelerated scikit-learn library\n",
33-
"to train and predict the US total income with education information."
31+
"In this example we will be running an end-to-end machine learning workload with US census data from 1970 to 2010.\n",
32+
"It uses Intel® Distribution of Modin with Ray as backend compute engine for ETL, and uses Ridge Regression algorithm from Intel scikit-learn-extension library to train and predict the co-relation between US total income and education levels."
3433
]
3534
},
3635
{
@@ -55,30 +54,19 @@
5554
},
5655
{
5756
"cell_type": "markdown",
58-
"metadata": {
59-
"pycharm": {
60-
"name": "#%% md\n"
61-
}
62-
},
57+
"metadata": {},
6358
"source": [
64-
"Import"
59+
"Import basic python modules"
6560
]
6661
},
6762
{
6863
"cell_type": "code",
6964
"execution_count": null,
70-
"metadata": {
71-
"pycharm": {
72-
"name": "#%%\n"
73-
}
74-
},
65+
"metadata": {},
7566
"outputs": [],
7667
"source": [
7768
"import os\n",
78-
"import numpy as np\n",
79-
"\n",
80-
"from sklearn import config_context\n",
81-
"from sklearn.metrics import mean_squared_error, r2_score"
69+
"import numpy as np"
8270
]
8371
},
8472
{
@@ -89,7 +77,7 @@
8977
}
9078
},
9179
"source": [
92-
"Import Modin and set Ray as the compute engine"
80+
"Import Modin and set Ray as the compute engine. This engine uses analytical database OmniSciDB to obtain high single-node scalability for specific set of dataframe operations. "
9381
]
9482
},
9583
{
@@ -102,8 +90,11 @@
10290
},
10391
"outputs": [],
10492
"source": [
105-
"import modin.pandas as pd\n",
106-
"os.environ[\"MODIN_ENGINE\"] = \"ray\""
93+
"#import modin.pandas as pd\n",
94+
"os.environ[\"MODIN_ENGINE\"] = \"ray\"\n",
95+
"os.environ[\"MODIN_BACKEND\"] = \"omnisci\"\n",
96+
"os.environ[\"MODIN_EXPERIMENTAL\"] = \"True\"\n",
97+
"import modin.pandas as pd"
10798
]
10899
},
109100
{
@@ -114,7 +105,7 @@
114105
}
115106
},
116107
"source": [
117-
"Load DAAL accelerated sklearn patch and import packages from the patch"
108+
"Import Intel(R) Extension for Scikit-learn which dynamically patches scikit-learn estimators to use Intel(R) oneAPI Data Analytics Library as the underlying solver, while getting the same solution faster."
118109
]
119110
},
120111
{
@@ -127,9 +118,11 @@
127118
},
128119
"outputs": [],
129120
"source": [
130-
"import daal4py.sklearn\n",
131-
"daal4py.sklearn.patch_sklearn()\n",
121+
"from sklearnex import patch_sklearn\n",
122+
"patch_sklearn()\n",
132123
"\n",
124+
"from sklearn import config_context\n",
125+
"from sklearn.metrics import mean_squared_error, r2_score\n",
133126
"from sklearn.model_selection import train_test_split\n",
134127
"import sklearn.linear_model as lm"
135128
]
@@ -142,7 +135,7 @@
142135
}
143136
},
144137
"source": [
145-
"Read the data from the downloaded archive file"
138+
"Read and load the data into a dataframe from the downloaded archive file"
146139
]
147140
},
148141
{
@@ -166,7 +159,7 @@
166159
}
167160
},
168161
"source": [
169-
"ETL"
162+
"Run ETL operations to prepare and transform the ingested dataset into a form that can be readily consumed by the ridge regression algorithm. Keep columns that are relevant, clean up the samples with invalid income, education and normalize the income to account for yearly inflation"
170163
]
171164
},
172165
{
@@ -209,7 +202,7 @@
209202
"cell_type": "markdown",
210203
"metadata": {},
211204
"source": [
212-
"Train the model and predict the income"
205+
"Train the model and run prediction. Loop 50 times to remove any bias in splitting the dataset into train & test set, in order to reduce chance of over-fitting from selecting a train set that fits the model too well to the test set"
213206
]
214207
},
215208
{
@@ -254,7 +247,7 @@
254247
"cell_type": "markdown",
255248
"metadata": {},
256249
"source": [
257-
"Check the regression results: mean squared error and r square score"
250+
"Check the regression results by calculating the accuracy of the prediction using mean squared error and r square score"
258251
]
259252
},
260253
{
@@ -275,92 +268,13 @@
275268
"print(\"mean COD ± deviation: {:.9f} ± {:.9f}\".format(mean_cod, cod_dev))"
276269
]
277270
},
278-
{
279-
"cell_type": "code",
280-
"execution_count": null,
281-
"metadata": {},
282-
"outputs": [],
283-
"source": [
284-
"mean_mse = sum(mse_values)/len(mse_values)\n",
285-
"mean_cod = sum(cod_values)/len(cod_values)\n",
286-
"mse_dev = pow(sum([(mse_value - mean_mse)**2 for mse_value in mse_values])/(len(mse_values) - 1), 0.5)\n",
287-
"cod_dev = pow(sum([(cod_value - mean_cod)**2 for cod_value in cod_values])/(len(cod_values) - 1), 0.5)\n",
288-
"print(\"mean MSE ± deviation: {:.9f} ± {:.9f}\".format(mean_mse, mse_dev))\n",
289-
"print(\"mean COD ± deviation: {:.9f} ± {:.9f}\".format(mean_cod, cod_dev))\n"
290-
]
291-
},
292271
{
293272
"cell_type": "markdown",
294273
"metadata": {},
295274
"source": [
296-
"Train the model and predict the income"
297-
]
298-
},
299-
{
300-
"cell_type": "code",
301-
"execution_count": null,
302-
"metadata": {
303-
"pycharm": {
304-
"name": "#%%\n"
305-
}
306-
},
307-
"outputs": [],
308-
"source": [
309-
"# ML - training and inference\n",
310-
"clf = lm.Ridge()\n",
311-
"\n",
312-
"mse_values, cod_values = [], []\n",
313-
"N_RUNS = 50\n",
314-
"TRAIN_SIZE = 0.9\n",
315-
"random_state = 777\n",
316-
"\n",
317-
"X = np.ascontiguousarray(X, dtype=np.float64)\n",
318-
"y = np.ascontiguousarray(y, dtype=np.float64)\n",
319-
"\n",
320-
"# cross validation\n",
321-
"for i in range(N_RUNS):\n",
322-
" X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=TRAIN_SIZE,\n",
323-
" random_state=random_state)\n",
324-
" random_state += 777\n",
325-
"\n",
326-
" # training\n",
327-
" with config_context(assume_finite=True):\n",
328-
" model = clf.fit(X_train, y_train)\n",
329-
"\n",
330-
" # inference\n",
331-
" y_pred = model.predict(X_test)\n",
332-
"\n",
333-
" mse_values.append(mean_squared_error(y_test, y_pred))\n",
334-
" cod_values.append(r2_score(y_test, y_pred))"
335-
]
336-
},
337-
{
338-
"cell_type": "markdown",
339-
"metadata": {
340-
"pycharm": {
341-
"name": "#%% md\n"
342-
}
343-
},
344-
"source": [
345-
"Check the regression results: mean squared error and r square score"
346-
]
347-
},
348-
{
349-
"cell_type": "code",
350-
"execution_count": null,
351-
"metadata": {
352-
"pycharm": {
353-
"name": "#%%\n"
354-
}
355-
},
356-
"outputs": [],
357-
"source": [
358-
"mean_mse = sum(mse_values)/len(mse_values)\n",
359-
"mean_cod = sum(cod_values)/len(cod_values)\n",
360-
"mse_dev = pow(sum([(mse_value - mean_mse)**2 for mse_value in mse_values])/(len(mse_values) - 1), 0.5)\n",
361-
"cod_dev = pow(sum([(cod_value - mean_cod)**2 for cod_value in cod_values])/(len(cod_values) - 1), 0.5)\n",
362-
"print(\"mean MSE ± deviation: {:.9f} ± {:.9f}\".format(mean_mse, mse_dev))\n",
363-
"print(\"mean COD ± deviation: {:.9f} ± {:.9f}\".format(mean_cod, cod_dev))\n"
275+
"Verify the accuracy:\n",
276+
"mean MSE ± deviation: 0.032564569 ± 0.000041799\n",
277+
"mean COD ± deviation: 0.995367533 ± 0.000005869"
364278
]
365279
}
366280
],
@@ -380,7 +294,7 @@
380294
"name": "python",
381295
"nbconvert_exporter": "python",
382296
"pygments_lexer": "ipython3",
383-
"version": "3.7.7"
297+
"version": "3.7.11"
384298
}
385299
},
386300
"nbformat": 4,

AI-and-Analytics/End-to-end-Workloads/Census/sample.json

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"guid": "AA055D7B-290C-4FCA-990B-B9FC88AF18D4",
33
"name": "Census",
44
"categories": ["Toolkit/oneAPI AI And Analytics/End-to-End Workloads"],
5-
"description": "This sample illustrates using Modin and daal optimized scikit-learn to build and run an end-to-end machine learning workload",
5+
"description": "This sample illustrates the use of Intel Distrabution of Modin and Intel Extension for Scikit-learn to build and run an end-to-end machine learning workload",
66
"builder": ["cli"],
77
"languages": [{"python":{}}],
88
"dependencies": ["intelpython"],
@@ -11,8 +11,11 @@
1111
"ciTests": {
1212
"linux": [
1313
{
14-
"env": ["source /opt/intel/oneapi/setvars.sh --force", "conda create -y -n aikit-modin-test -c intel -c conda-forge runipy intel-aikit-modin", "conda activate aikit-modin-test"],
14+
"env": ["source activate base"],
1515
"steps": [
16+
"conda create -y -n aikit-modin --override-channels intel-aikit-modin omniscidbe4py python=3.7 runipy ray-dashboard -c intel -c conda-forge",
17+
"conda activate aikit-modin",
18+
"pip install opencensus",
1619
"runipy census_modin.ipynb"
1720
]
1821
}

0 commit comments

Comments
 (0)