Skip to content

Commit 3847979

Browse files
authored
Sample code for the end-to-end workload (Census) added to the fork. This includes the jupyter notebook file, license.txt, sample.json and README (oneapi-src#305)
Signed-off-by: Vrushabh Sanghavi <[email protected]> Co-authored-by: Vrushabh Sanghavi <[email protected]>
1 parent 601cf88 commit 3847979

File tree

7 files changed

+527
-0
lines changed

7 files changed

+527
-0
lines changed
Loading
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
Copyright Intel Corporation
2+
3+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4+
5+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6+
7+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# End-to-end machine learning workload: Census
2+
This sample code illustrates how to use Modin for ETL operations and ridge regression algorithm from the DAAL accelerated scikit-learn library to build and run an end to end machine learning workload. It demonstrates how to use software products that can be found in the [Intel AI Analytics Toolkit powered by oneAPI](https://software.intel.com/content/www/us/en/develop/tools/oneapi/ai-analytics-toolkit.html).
3+
4+
| Optimized for | Description
5+
| :--- | :---
6+
| OS | 64-bit Linux: Ubuntu 18.04 or higher
7+
| Hardware | Intel Atom® Processors; Intel® Core™ Processor Family; Intel® Xeon® Processor Family; Intel® Xeon® Scalable Performance Processor Family
8+
| Software | Python version 3.7, Modin, Ray, daal4py, Scikit-Learn, NumPy, Intel® AI Analytics Toolkit
9+
| What you will learn | How to use Modin and DAAL optimized scikit-learn (developed and owned by Intel) to build end to end ML workloads and gain performance.
10+
| Time to complete | 15-18 minutes
11+
12+
## Purpose
13+
Modin uses Ray to provide an effortless way to speed up your Pandas notebooks, scripts and libraries. Unlike other distributed DataFrame libraries, Modin provides seamless integration and compatibility with existing Pandas code. Daal4py is a simplified API to Intel DAAL that allows for fast usage of the framework suited for Data Scientists and Machine Learning users. It is built to help provide an abstraction to Intel® DAAL for either direct usage or integration into one's own framework.
14+
15+
#### Model and dataset
16+
In this sample, you will use Modin to ingest and process U.S. census data from 1970 to 2010 in order to build a ridge regression based model to find the relation between education and the total income earned in the US.
17+
Data transformation stage normalizes the income to the yearly inflation, balances the data such that each year has a similar number of data points, and extracts the features from the transformed dataset. The feature vectors are fed into the ridge regression model to predict the income of each sample.
18+
19+
Dataset is from IPUMS USA, University of Minnesota , [www.ipums.org](https://ipums.org/) (Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas and Matthew Sobek. IPUMS USA: Version 10.0 [dataset]. Minneapolis, MN: IPUMS, 2020. https://doi.org/10.18128/D010.V10.0)
20+
21+
## Key Implementation Details
22+
This end-to-end workload sample code is implemented for CPU using the Python language. The example requires you to have Modin, Ray, daal4py, Scikit-Learn, NumPy installed inside a conda environment, similar to what is directed by the [oneAPI AI Analytics Toolkit powered by oneAPI](https://software.intel.com/content/www/us/en/develop/articles/installing-ai-kit-with-conda.html) as well as the steps that follow in this README.
23+
24+
## License
25+
26+
This code sample is licensed under MIT license
27+
28+
## Building Modin and daal4py for CPU to build and run end-to-end workload
29+
30+
Modin and oneAPI Data Analytics Library (DAAL) is ready for use once you finish the Intel AI Analytics Toolkit installation with the Conda Package Manager.
31+
32+
You can refer to the oneAPI [main page](https://software.intel.com/en-us/oneapi), and the Toolkit [Getting Started Guide for Linux](https://software.intel.com/content/www/us/en/develop/documentation/get-started-with-ai-linux/top.html) for installation steps and scripts.
33+
34+
35+
### Activate conda environment With Root Access
36+
37+
Please follow the Getting Started Guide steps (above) to set up your oneAPI environment with the `setvars.sh` script and Intel Distribution of Modin environment installation (https://software.intel.com/content/www/us/en/develop/articles/installing-ai-kit-with-conda.html). Then navigate in Linux shell to your oneapi installation path, typically `/opt/intel/oneapi/` when installed as root or sudo, and `~/intel/oneapi/` when not installed as a super user. If you customized the installation folder, the `setvars.sh` file is in your custom folder.
38+
39+
Activate the conda environment with the following command:
40+
41+
#### Linux
42+
```
43+
source activate intel-aikit-modin
44+
```
45+
46+
### Activate conda environment Without Root Access (Optional)
47+
48+
By default, the Intel AI Analytics toolkit is installed in the `oneapi` folder, which requires root privileges to manage it. If you would like to bypass using root access to manage your conda environment, then you can clone your desired conda environment using the following command:
49+
50+
#### Linux
51+
```
52+
conda create --name intel-aikit-modin -c intel/label/oneapibeta -c intel -c conda-forge runipy intel-aikit-modin=2021.1b10
53+
```
54+
55+
Then activate your conda environment with the following command:
56+
57+
```
58+
conda activate intel-aikit-modin
59+
```
60+
61+
62+
### Install Jupyter Notebook
63+
64+
Launch Jupyter Notebook in the directory housing the code example
65+
66+
```
67+
conda install jupyter nb_conda_kernels
68+
```
69+
or
70+
```
71+
pip install jupyter
72+
```
73+
74+
### Install wget package
75+
76+
Install wget package in order to retrieve the Census dataset using HTTPS
77+
78+
```
79+
pip install wget
80+
```
81+
82+
#### View in Jupyter Notebook
83+
84+
85+
Launch Jupyter Notebook in the directory housing the code example
86+
87+
```
88+
jupyter notebook
89+
```
90+
91+
## Running the end-to-end code sample
92+
93+
### Run as Jupyter Notebook
94+
95+
Open .ipynb file and run cells in Jupyter Notebook using the "Run" button. Alternatively, the entire workbook can be run using the "Restart kernel and re-run whole notebook" button. (see image below using "census modin" sample)
96+
97+
![Click the Run Button in the Jupyter Notebook](Running_Jupyter_notebook.jpg "Run Button on Jupyter Notebook")
98+
99+
### Run as Python File
100+
101+
Open notebook in Jupyter and download as python file (see image using "census modin" sample)
102+
103+
![Download as python file in the Jupyter Notebook](Running_Jupyter_notebook_as_Python.jpg "Download as python file in the Jupyter Notebook")
104+
105+
Run the Program
106+
107+
`python census_modin.py`
108+
109+
##### Expected Printed Output:
110+
Expected Cell Output shown for census_modin.ipynb:
111+
![Output](Expected_output.jpg "Expected output for Jupyter Notebook")
Loading
Loading

0 commit comments

Comments
 (0)