Update README.md

alexsuttonms · alexsuttonms · commit 38a61defe1df · 2022-03-19T12:52:26.000-07:00
diff --git a/README.md b/README.md
@@ -1,57 +1,56 @@
-# Project Name
+# AzureML and Data Science Overview Workshop
 
-(short, 1-3 sentenced, description of the project)
+The purpose of this workshop is for you to work through a basic end-to-end flow for a data scientist starting to work with AzureML to train a model, test it with an endpoint, then experiment to improve the outcomes. Work through the scenario with the Studio, CLI v2, and the SDK where appropriate. Use our docs and samples to figure out syntax. The scenario has been tested and you’ll see that the product is pretty good overall. Things are not consistent and some experiences aren’t good. With the exception of some internal private preview features, this is what our customers see.
 
-## Features
+## 0. Getting Started
 
-This project framework provides the following features:
+You need access to an Azure subscription. If it’s a shared subscription, the owner should provide you with a resource group and assign you as owner so you can wear an IT hat and create a workspace and required resources. 
 
-* Feature 1
-* Feature 2
-* ...
+This repo [repo link] has the python scripts and data files you need to get started. 
 
-## Getting Started
+Hint: You need compute quota for AzureML in the region for the workspace for the VM family you want to use. 4 cores of DSv2 should be fine. 
 
-### Prerequisites
 
-(ideally very short, if any)
+## 1. Train and test Locally
+First step is to train a model locally to make sure it works. You can do this with a Compute Instance, a DSVM, or your local machine. Linux is easier to setup than Windows. You can work from Visual Studio Code, a terminal window, or notebook. 
 
-- OS
-- Library version
-- ...
+Once you’ve trained a model, try the score.py script to test it using a subset of the training data. 
 
-### Installation
+Hints: For local training, the train.py script is setup to use the training data csv file in the same directory. It writes the model file to a new folder, deleting an existing folder with the same name if found. The script uses MLFlow logging and will log to an AzureML workspace without having to submit a run. You may need to configure your Python environment with the packages needed for the training script. 
 
-(ideally very short)
 
-- npm install [package name]
-- mvn install
-- ...
+## 2. Train in Cloud
+Now that you’ve confirmed the training code works, train the model in the cloud using an AzureML job. Try this from the Studio and the v2 CLI. Use the train.py and training data CSV from the repo (but imagine that you’re training on petabytes of data). 
 
-### Quickstart
-(Add steps to get up and running quickly)
 
-1. git clone [repository clone url]
-2. cd [respository name]
-3. ...
+## 3. Create Managed Real-Time Endpoint
+After training a model, create a real-time managed endpoint and test us using the sample JSON file in the repo. Try this with the Studio and the v2 CLI. 
 
 
-## Demo
+You could also write a script or app to use the endpoint. 
 
-A demo app is included to show how to use the project.
+## 4. Create Managed Batch Endpoint
+Next create a batch endpoint. There is a scoring CSV in the repo. If you do this in the Studio, you will see that a pipeline was created. Look at the results and how many flights are predicted to be on-time vs delayed. 
 
-To run the demo, follow these steps:
 
-(Add steps to start up the demo)
+## 5. Experiment
+Now we get into the science part to experiment on how we could improve the model. Edit the train.py file and change [param] to True.
 
-1.
-2.
-3.
+Train the model again and compare metrics with the first model. What do you see?
 
-## Resources
+Test the new model by creating a new batch endpoint or scoring locally. Are the results different? What metrics do you think are important to measure quality of this prediction?
 
-(Any additional resources or related projects)
 
-- Link to supporting information
-- Link to similar sample
-- ...
+## 6. Explore the Data
+The data imbalance hyperparameter in LightGBM helps in the case where the training data has more on-time flights and so biased the outcome in the model. This is something a data scientist needs to be aware of.
+
+Try different ways to explore the data. Open it in Excel, important it as a tabular dataset, or use Python tools. Think about how we can help with this.
+
+
+## 7. Reflect and Discuss
+This was a simple exercise, but used the breadth of AzureML for ML Pros working with python scripts using Studio, CLI, VS Code or other tools. You had to create a basic workspace (no VNET today), create and use computes, train models, look at metrics, create and test endpoints, and experiment with the training code and data.
+Was it easy to get started and figure out what to do? Did you need to look at documentation or samples? We’re things consistent across the Studio, v2 CLI and SDK? Did you have to troubleshoot any error messages? Was the Studio experience intuitive? What are you going to personally improve from what you experienced in this workshop?
+
+
+## 8. Bonus: Improve the Model
+Now think about this as a Kaggle competition. Have some fun and try out different ways to improve the model quality. You could look at AutoML, Flaml, or other approaches. Share your best model with the team.