|
| 1 | +# End-to-end machine learning workload: Census |
| 2 | +This sample code illustrates how to use Modin for ETL operations and ridge regression algorithm from the DAAL accelerated scikit-learn library to build and run an end to end machine learning workload. It demonstrates how to use software products that can be found in the [Intel AI Analytics Toolkit powered by oneAPI](https://software.intel.com/content/www/us/en/develop/tools/oneapi/ai-analytics-toolkit.html). |
| 3 | + |
| 4 | +| Optimized for | Description |
| 5 | +| :--- | :--- |
| 6 | +| OS | 64-bit Linux: Ubuntu 18.04 or higher |
| 7 | +| Hardware | Intel Atom® Processors; Intel® Core™ Processor Family; Intel® Xeon® Processor Family; Intel® Xeon® Scalable Performance Processor Family |
| 8 | +| Software | Python version 3.7, Modin, Ray, daal4py, Scikit-Learn, NumPy, Intel® AI Analytics Toolkit |
| 9 | +| What you will learn | How to use Modin and DAAL optimized scikit-learn (developed and owned by Intel) to build end to end ML workloads and gain performance. |
| 10 | +| Time to complete | 15-18 minutes |
| 11 | + |
| 12 | +## Purpose |
| 13 | +Modin uses Ray to provide an effortless way to speed up your Pandas notebooks, scripts and libraries. Unlike other distributed DataFrame libraries, Modin provides seamless integration and compatibility with existing Pandas code. Daal4py is a simplified API to Intel DAAL that allows for fast usage of the framework suited for Data Scientists and Machine Learning users. It is built to help provide an abstraction to Intel® DAAL for either direct usage or integration into one's own framework. |
| 14 | + |
| 15 | +#### Model and dataset |
| 16 | +In this sample, you will use Modin to ingest and process U.S. census data from 1970 to 2010 in order to build a ridge regression based model to find the relation between education and the total income earned in the US. |
| 17 | +Data transformation stage normalizes the income to the yearly inflation, balances the data such that each year has a similar number of data points, and extracts the features from the transformed dataset. The feature vectors are fed into the ridge regression model to predict the income of each sample. |
| 18 | + |
| 19 | +Dataset is from IPUMS USA, University of Minnesota , [www.ipums.org](https://ipums.org/) (Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas and Matthew Sobek. IPUMS USA: Version 10.0 [dataset]. Minneapolis, MN: IPUMS, 2020. https://doi.org/10.18128/D010.V10.0) |
| 20 | + |
| 21 | +## Key Implementation Details |
| 22 | +This end-to-end workload sample code is implemented for CPU using the Python language. The example requires you to have Modin, Ray, daal4py, Scikit-Learn, NumPy installed inside a conda environment, similar to what is directed by the [oneAPI AI Analytics Toolkit powered by oneAPI](https://software.intel.com/content/www/us/en/develop/articles/installing-ai-kit-with-conda.html) as well as the steps that follow in this README. |
| 23 | + |
| 24 | +## License |
| 25 | + |
| 26 | +This code sample is licensed under MIT license |
| 27 | + |
| 28 | +## Building Modin and daal4py for CPU to build and run end-to-end workload |
| 29 | + |
| 30 | +Modin and oneAPI Data Analytics Library (DAAL) is ready for use once you finish the Intel AI Analytics Toolkit installation with the Conda Package Manager. |
| 31 | + |
| 32 | +You can refer to the oneAPI [main page](https://software.intel.com/en-us/oneapi), and the Toolkit [Getting Started Guide for Linux](https://software.intel.com/content/www/us/en/develop/documentation/get-started-with-ai-linux/top.html) for installation steps and scripts. |
| 33 | + |
| 34 | + |
| 35 | +### Activate conda environment With Root Access |
| 36 | + |
| 37 | +Please follow the Getting Started Guide steps (above) to set up your oneAPI environment with the `setvars.sh` script and Intel Distribution of Modin environment installation (https://software.intel.com/content/www/us/en/develop/articles/installing-ai-kit-with-conda.html). Then navigate in Linux shell to your oneapi installation path, typically `/opt/intel/oneapi/` when installed as root or sudo, and `~/intel/oneapi/` when not installed as a super user. If you customized the installation folder, the `setvars.sh` file is in your custom folder. |
| 38 | + |
| 39 | +Activate the conda environment with the following command: |
| 40 | + |
| 41 | +#### Linux |
| 42 | +``` |
| 43 | +source activate intel-aikit-modin |
| 44 | +``` |
| 45 | + |
| 46 | +### Activate conda environment Without Root Access (Optional) |
| 47 | + |
| 48 | +By default, the Intel AI Analytics toolkit is installed in the `oneapi` folder, which requires root privileges to manage it. If you would like to bypass using root access to manage your conda environment, then you can clone your desired conda environment using the following command: |
| 49 | + |
| 50 | +#### Linux |
| 51 | +``` |
| 52 | +conda create --name intel-aikit-modin -c intel/label/oneapibeta -c intel -c conda-forge runipy intel-aikit-modin=2021.1b10 |
| 53 | +``` |
| 54 | + |
| 55 | +Then activate your conda environment with the following command: |
| 56 | + |
| 57 | +``` |
| 58 | +conda activate intel-aikit-modin |
| 59 | +``` |
| 60 | + |
| 61 | + |
| 62 | +### Install Jupyter Notebook |
| 63 | + |
| 64 | +Launch Jupyter Notebook in the directory housing the code example |
| 65 | + |
| 66 | +``` |
| 67 | +conda install jupyter nb_conda_kernels |
| 68 | +``` |
| 69 | +or |
| 70 | +``` |
| 71 | +pip install jupyter |
| 72 | +``` |
| 73 | + |
| 74 | +### Install wget package |
| 75 | + |
| 76 | +Install wget package in order to retrieve the Census dataset using HTTPS |
| 77 | + |
| 78 | +``` |
| 79 | +pip install wget |
| 80 | +``` |
| 81 | + |
| 82 | +#### View in Jupyter Notebook |
| 83 | + |
| 84 | + |
| 85 | +Launch Jupyter Notebook in the directory housing the code example |
| 86 | + |
| 87 | +``` |
| 88 | +jupyter notebook |
| 89 | +``` |
| 90 | + |
| 91 | +## Running the end-to-end code sample |
| 92 | + |
| 93 | +### Run as Jupyter Notebook |
| 94 | + |
| 95 | +Open .ipynb file and run cells in Jupyter Notebook using the "Run" button. Alternatively, the entire workbook can be run using the "Restart kernel and re-run whole notebook" button. (see image below using "census modin" sample) |
| 96 | + |
| 97 | + |
| 98 | + |
| 99 | +### Run as Python File |
| 100 | + |
| 101 | +Open notebook in Jupyter and download as python file (see image using "census modin" sample) |
| 102 | + |
| 103 | + |
| 104 | + |
| 105 | +Run the Program |
| 106 | + |
| 107 | +`python census_modin.py` |
| 108 | + |
| 109 | +##### Expected Printed Output: |
| 110 | +Expected Cell Output shown for census_modin.ipynb: |
| 111 | + |
0 commit comments