You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
E2E Census code updated, modin, scikit-learn extension and readme fixed (oneapi-src#641)
* updates to README
* Update README.md
* update to Jupyter notebook
* Census readme updated
* Census readme updated
* sample.json updated for Census
* updates to sample.json
* typo fixed
* updates to sample.json
This sample code illustrates how to use Intel® Distribution of Modin for ETL operations and ridge regression algorithm from the Intel® oneAPI Data Analytics Library (oneDAL) accelerated scikit-learn library to build and run an end to end machine learning workload. Both Intel® Distribution of Modin and oneDAL accelerated scikit-learn libraries are available together in [Intel® oneAPI AI Analytics Toolkit](https://software.intel.com/content/www/us/en/develop/tools/oneapi/ai-analytics-toolkit.html). This sample code demonstrates how to seamlessly run the end-to-end census workload using the toolkit, without any external dependencies.
3
+
This sample code illustrates how to use Intel® Distribution of Modin for ETL operations and ridge regression algorithm from the Intel® extension of scikit-learn library to build and run an end to end machine learning workload. Both Intel® Distribution of Modin and Intel® Extension for Scikit-learn libraries are available together in [Intel® oneAPI AI Analytics Toolkit](https://software.intel.com/content/www/us/en/develop/tools/oneapi/ai-analytics-toolkit.html). This sample code demonstrates how to seamlessly run the end-to-end census workload using the toolkit, without any external dependencies.
| Software | Intel® AI Analytics Toolkit (Python version 3.7, Intel® Distribution of Modin , Ray, Intel® oneAPI Data Analytics Library (oneDAL), Scikit-Learn, NumPy)
10
-
| What you will learn | How to use Intel® Distribution of Modin and oneDAL optimized scikit-learn (developed and owned by Intel) to build end to end ML workloads and gain performance.
9
+
| Software | Intel® AI Analytics Toolkit (Python version 3.7, Intel® Distribution of Modin , Ray, Intel® Extension for Scikit-Learn, NumPy)
10
+
| What you will learn | How to use Intel® Distribution of Modin and Intel® Extension for Scikit-learn to build end to end ML workloads and gain performance.
11
11
| Time to complete | 15-18 minutes
12
12
13
13
## Purpose
14
-
Intel® Distribution of Modin uses Ray to provide an effortless way to speed up your Pandas notebooks, scripts and libraries. Unlike other distributed DataFrame libraries, Intel® Distribution of Modin provides seamless integration and compatibility with existing Pandas code. Daal4py is a simplified API to Intel oneDAL that allows for fast usage of the framework suited for Data Scientists and Machine Learning users. It is built to help provide an abstraction to Intel® oneDAL for either direct usage or integration into one's own framework.
14
+
Intel® Distribution of Modin uses Ray to provide an effortless way to speed up your Pandas notebooks, scripts and libraries. Unlike other distributed DataFrame libraries, Intel® Distribution of Modin provides seamless integration and compatibility with existing Pandas code. Intel(R) Extension for Scikit-learn dynamically patches scikit-learn estimators to use Intel(R) oneAPI Data Analytics Library as the underlying solver, while getting the same solution faster.
15
15
16
16
#### Model and dataset
17
17
In this sample, you will use Intel® Distribution of Modin to ingest and process U.S. census data from 1970 to 2010 in order to build a ridge regression based model to find the relation between education and the total income earned in the US.
18
-
Data transformation stage normalizes the income to the yearly inflation, balances the data such that each year has a similar number of data points, and extracts the features from the transformed dataset. The feature vectors are fed into the ridge regression model to predict the income of each sample.
18
+
Data transformation stage normalizes the income to the yearly inflation, balances the data such that each year has a similar number of data points, and extracts the features from the transformed dataset. The feature vectors are fed into the ridge regression model to predict the education of each sample.
19
19
20
20
Dataset is from IPUMS USA, University of Minnesota, [www.ipums.org](https://ipums.org/) (Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas and Matthew Sobek. IPUMS USA: Version 10.0 [dataset]. Minneapolis, MN: IPUMS, 2020. https://doi.org/10.18128/D010.V10.0)
21
21
22
22
## Key Implementation Details
23
-
This end-to-end workload sample code is implemented for CPU using the Python language. With the installation of Intel AI Analytics Toolkit, the conda environment is prepared with Python version 3.7, Intel® Distribution of Modin , Ray, Intel® oneAPI Data Analytics Library (oneDAL), Scikit-Learn, NumPy following which the sample code can be directly run using the underlying steps in this README.
23
+
This end-to-end workload sample code is implemented for CPU using the Python language. With the installation of Intel AI Analytics Toolkit, the conda environment is prepared with Python version 3.7, Intel® Distribution of Modin , Ray, Intel® Extension for Scikit-Learn, NumPy following which the sample code can be directly run using the underlying steps in this README.
24
24
25
25
## License
26
26
@@ -29,70 +29,52 @@ Code samples are licensed under the MIT license. See
29
29
30
30
Third party program Licenses can be found here: [third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt).
31
31
32
-
## Running Samples on the Intel® DevCloud
33
-
If you are running this sample on the Intel® DevCloud, skip the Pre-requirements and go to the [Activate Conda Environment](#activate-conda) section.
32
+
## Building Intel® Distribution of Modin and Intel® Extension for Scikit-learn for CPU to build and run end-to-end workload
33
+
Intel® Distribution of Modin and Intel® Extension for Scikit-learn is ready for use once you finish the Intel AI Analytics Toolkit installation with the Conda Package Manager.
34
34
35
-
## Building Intel® Distribution of Modin and Intel® oneAPI Data Analytics Library (oneDAL) for CPU to build and run end-to-end workload
35
+
You can refer to the oneAPI [main page](https://software.intel.com/en-us/oneapi), and the Intel® oneAPI Toolkit [Installation Guide](https://software.intel.com/content/www/us/en/develop/documentation/installation-guide-for-intel-oneapi-toolkits-linux/top/installation/install-using-package-managers/conda/install-intel-ai-analytics-toolkit-via-conda.html) for conda environment setup and installation steps.
36
36
37
-
### Pre-requirements (Local or Remote Host Installation)
38
-
Intel® Distribution of Modin and Intel® oneAPI Data Analytics Library (oneDAL) is ready for use once you finish the Intel AI Analytics Toolkit installation with the Conda Package Manager.
39
-
40
-
You can refer to the oneAPI [main page](https://software.intel.com/en-us/oneapi), and the Toolkit [Getting Started Guide for Linux](https://software.intel.com/content/www/us/en/develop/documentation/get-started-with-ai-linux/top.html) for installation steps and scripts.
41
-
42
-
### Activate conda environment With Root Access<aname="activate-conda"></a>
43
-
44
-
In the Linux shell, navigate to your oneapi installation path, typically `/opt/intel/oneapi/` when installed as root or sudo, and `~/intel/oneapi/` when not installed as a super user.
45
-
46
-
Activate the conda environment with the following command:
### Activate conda environment Without Root Access (Optional)
51
-
52
-
By default, the Intel oneAPI AI Analytics toolkit is installed in the `oneapi` folder, which requires root privileges to manage it. If you would like to bypass using root access to manage your conda environment, then you can clone your desired conda environment using the following command:
37
+
### Activate conda environment
53
38
39
+
To install the Intel Distribution of Modin python environment, use the following command:
Then activate your conda environment with the following command:
45
+
```
46
+
conda activate aikit-modin
57
47
```
58
48
49
+
Additionally, install the following in the conda environment
59
50
60
-
### Install Jupyter Notebook*
61
-
62
-
Launch Jupyter Notebook in the directory housing the code example.
63
-
51
+
### Install Jupyter Notebook
52
+
Needed to launch Jupyter Notebook in the directory housing the code example
64
53
```
65
54
conda install jupyter nb_conda_kernels
66
-
@@ -76,7 +80,7 @@ pip install jupyter
67
-
68
-
### Install wget package
69
-
70
-
Install wget package to retrieve the Census dataset using HTTPS.
71
-
72
55
```
73
-
pip install wget
74
-
@@ -85,7 +89,7 @@ pip install wget
75
-
#### View in Jupyter Notebook
76
56
57
+
### ray-dashboard and opencensus
58
+
```
59
+
conda install ray-dashboard
60
+
pip install opencensus
61
+
```
77
62
78
-
Launch Jupyter Notebook in the directory housing the code example.
79
-
63
+
#### View in Jupyter Notebook
64
+
Launch Jupyter Notebook in the directory housing the code example
80
65
```
81
66
jupyter notebook
82
-
@@ -112,3 +116,16 @@ Run the Program
67
+
```
68
+
## Running the end-to-end code sample
69
+
### Run as Jupyter Notebook
70
+
Open .ipynb file and run cells in Jupyter Notebook using the "Run" button. Alternatively, the entire workbook can be run using the "Restart kernel and re-run whole notebook" button. (see image below using "census modin" sample)
71
+

72
+
73
+
### Run as Python File
74
+
Open notebook in Jupyter and download as python file (see image using "census modin" sample)
75
+

76
+
Run the Program
77
+
`python census_modin.py`
83
78
##### Expected Printed Output:
84
79
Expected Cell Output shown for census_modin.ipynb:
85
80

86
-
87
-
88
-
### Request a Compute Node
89
-
In order to run on the DevCloud, you need to request a compute node using node properties such as: `gpu`, `xeon`, `fpga_compile`, `fpga_runtime` and others. For more information about the node properties, execute the `pbsnodes` command.
90
-
This node information must be provided when submitting a job to run your sample in batch mode using the qsub command. When you see the qsub command in the Run section of the [Hello World instructions](https://devcloud.intel.com/oneapi/get_started/aiAnalyticsToolkitSamples/), change the command to fit the node you are using. Nodes which are in bold indicate they are compatible with this sample:
"# Census with Modin and Intel® Data Analytics and Acceleration Library (DAAL) Accelerated Scikit-Learn"
20
+
"# End-to-end Census workload with Intel® Distribution of Modin and Intel® Extension for Scikit-learn"
21
21
]
22
22
},
23
23
{
@@ -28,9 +28,8 @@
28
28
}
29
29
},
30
30
"source": [
31
-
"In this example we will be building an end to end machine learning workload with US census from 1970 to 2010.\n",
32
-
"It uses Modin with Ray as backend compute engine for ETL, and uses Ridge Regression from DAAL accelerated scikit-learn library\n",
33
-
"to train and predict the US total income with education information."
31
+
"In this example we will be running an end-to-end machine learning workload with US census data from 1970 to 2010.\n",
32
+
"It uses Intel® Distribution of Modin with Ray as backend compute engine for ETL, and uses Ridge Regression algorithm from Intel scikit-learn-extension library to train and predict the co-relation between US total income and education levels."
"Import Modin and set Ray as the compute engine. This engine uses analytical database OmniSciDB to obtain high single-node scalability for specific set of dataframe operations. "
"Load DAAL accelerated sklearn patch and import packages from the patch"
108
+
"Import Intel(R) Extension for Scikit-learn which dynamically patches scikit-learn estimators to use Intel(R) oneAPI Data Analytics Library as the underlying solver, while getting the same solution faster."
"Read and load the data into a dataframe from the downloaded archive file"
146
139
]
147
140
},
148
141
{
@@ -166,7 +159,7 @@
166
159
}
167
160
},
168
161
"source": [
169
-
"ETL"
162
+
"Run ETL operations to prepare and transform the ingested dataset into a form that can be readily consumed by the ridge regression algorithm. Keep columns that are relevant, clean up the samples with invalid income, education and normalize the income to account for yearly inflation"
170
163
]
171
164
},
172
165
{
@@ -209,7 +202,7 @@
209
202
"cell_type": "markdown",
210
203
"metadata": {},
211
204
"source": [
212
-
"Train the model and predict the income"
205
+
"Train the model and run prediction. Loop 50 times to remove any bias in splitting the dataset into train & test set, in order to reduce chance of over-fitting from selecting a train set that fits the model too well to the test set"
213
206
]
214
207
},
215
208
{
@@ -254,7 +247,7 @@
254
247
"cell_type": "markdown",
255
248
"metadata": {},
256
249
"source": [
257
-
"Check the regression results: mean squared error and r square score"
250
+
"Check the regression results by calculating the accuracy of the prediction using mean squared error and r square score"
Copy file name to clipboardExpand all lines: AI-and-Analytics/End-to-end-Workloads/Census/sample.json
+5-2Lines changed: 5 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
"guid": "AA055D7B-290C-4FCA-990B-B9FC88AF18D4",
3
3
"name": "Census",
4
4
"categories": ["Toolkit/oneAPI AI And Analytics/End-to-End Workloads"],
5
-
"description": "This sample illustrates using Modin and daal optimized scikit-learn to build and run an end-to-end machine learning workload",
5
+
"description": "This sample illustrates the use of Intel Distrabution of Modin and Intel Extension for Scikit-learn to build and run an end-to-end machine learning workload",
0 commit comments