A Guideline to Production Level Deep Learning [Under developement]

Deploying deep learning models in production could be challenging, as it's far beyond just training models with good perfromance. As you can see in the following figure, there are several components that need to be properly designed and developed in order to deploy a production level deep learning system:

This repo aims to serve as a an engineering guideline for building production-level deep learning systems to be deployed in real world applications.

(The material presented here is moslty borrowed from Full Stack Deep Learning Bootcamp (by Pieter Abbeel, Josh Tobin, and Sergey Karayev), TFX workshop by Robert Crowe, and Pipeline.ai's Advanced KubeFlow Meetup by Chris Fregly. )

The following figure represent a high level overview of different components in a production level deep learning system:

In the following, we will go through each module and recommend toolsets and frameworks as well as best practices from practioners that fit each component.

1. Data Management

1.1. Data Sources

Open source data (good to start with, not an advantage)
Data augmentation
Synthetic data

1.2. Labeling

Sources of labor for labeling:
- Crowdsourcing
- Service companies
  - FigureEight
- Hiring annotators
Labeling platforms:
- Prodigy: An annotation tool powered by active learning (by developers of Spacy), text and image
- HIVE: AI as a Service platform for computer vision
- Supervisely: entire computer vision platform
- Labelbox: computer vision
- Scale AI data platform (computer vision & NLP)

1.3. Storage

Data storage options:
- Object store: Store binary data (images, sound files, compressed texts)
  - Aamzon S3
  - Ceph Object Store
- Database: Store metadata (file paths, labels, user activity, etc).
  - Postgres is the right choice for most of applications, with the best-in-class SQL and great support for unstructured JSON.
- Data Lake: to aggregate features which are not obtainable from database (e.g. logs)
  - Amazon Redshift
- Feature Store: storage and access of machine learning features
  - FEAST (Google cloud, Open Source)
  - Michelangelo (Uber)
At train time: copy data into a local or networked filesystem

1.4. Versioning

DVC: Open source version control system for ML projects
Pachyderm: version control for data
Dolt: versioning for SQL database

1.5. Processing

Training data may come from different sources: Stored data in db and object stores, log processing, outputs of other classifiers.
There are dependencies between tasks, one needs to kick off after its dependencies are finished.

Workflows:
- Airflow (most commonly used)

2. Development, Training, and Evaluation

2.1. Software engineering

Editors:
- Vim
- VS Code (Recommended by the author)
  - Built in git staging and diff, Lint code, open projects remotely through ssh
- Jupyter Notebooks: Great as starting point of the projects, hard to scale
- Streamlit: interactive data science tool with applets
Compute recommendations ¹:
- For solo/startup:
  - Development: a 4x Turing-architecture PC
  - Training/Evaluation: Use the same 4x GPU PC. When running many experiments, either buy shared servers or use cloud instances.
- For larger companies:
  - Development: Buy a 4x Turing-architecture PC per ML scientist or let them use V100 instances
  - Training/Evaluation: Use cloud instances with proper provisioning and handling of failures

2.2. Resource Management

Allocating free resources to programs
Resource management options:
- Old school cluster job scheduler ( e.g. Slurm workload manager )
- Docker + Kubernetes
- Kubeflow
- Polyaxon (paid features)

2.3. DL Frameworks

Unless having a good reason not to, use Tensorflow/Keras or PyTorch ¹

2.4. Experiment management

Tensorboard
Losswise (Monitoring for ML)
Comet.ml
Weights & Biases
MLFlow tracking

2.5. Hyperparameter Tuning

Hyperas
SIGOPT
Ray - Tune
Weights & Biases

2.6. Distributed Training

Data parallelism: Use it when iteration time is too long (both tensorflow and PyTorch support)
Model parallelism: when model does not fit on a single GPU
Other solutions:
- Ray
- Horovod

3. Troubleshooting [TBD]

4. Testing and Deployment

4.1. Testing and CI/CD

Machine Learning production software requires a more diverse set of test suites than traditional software:

Unit and Integration Testing:
- Types of test:
  - Training system tests: testing training pipeline
  - Validation tests: testing prediction system on validation set
  - Functionality tests: testing prediction system on few important examples
Continuous Integration: Running tests after each new code change pushed to the repo
SaaS for continuous integration:
- CircleCI, Travis
- Jenkins, Buildkite

4.2. Web Depolyment

Consists of a Prediction System and a Serving System
- Prediction System: Process input data, make predictions
- Serving System (Web server):
  - Serve prediction with scale in mind
  - Use REST API to serve prediction HTTP requests
  - Calls the prediction system to respond
Serving options:
- 1. Deploy to VMs, scale by adding instances
- 1. Deploy as containers, scale via orchestration
  - Containers
    - Docker
  - Container Orchestration:
    - Kubernetes (the most popular now)
    - MESOS
    - Marathon
- 1. Deploy code as a "serverless function"
- 1. Deploy via a model serving solution
Model serving:
- Specialized web deployment for ML models
- Batches request for GPU inference
- Frameworks:
  - Tensorflow serving
  - MXNet Model server
  - Clipper (Berkeley)
  - SaaS solutions (Seldon, Algorithma)
Decision making:
- CPU inference:
  - CPU inference is preferable if it meets the requirements.
  - Scale by adding more servers, or going serverless.
- GPU inference:
  - TF serving or Clipper
  - Adaptive batching is useful

4.4. Monitoring:

Purpose:
- Alerts for downtime, errors, and distribution shifts
- Catching service and data regressions
Cloud providers solutions are decent

4.5. Deploying on Embedded and Mobile Devices

Main challenge: memory footprint and compute constraints
Solutions:
- Quantization
- Reduced model size
  - MobileNets
- Knowledge Distillation
  - DistillBERT (for NLP)
Embedded and Mobile Frameworks:
- Tensorflow Lite
- PyTorch Mobile
- Core ML
- ML Kit
- FRITZ
Model Conversion:
- Open Neural Network Exchange (ONNX): open-source format for deep learning models

4.6. All-in-one solutions

Tensorflow Extended (TFX)
Michelangelo (Uber)
Google Cloud AI Platform
Amazon SageMaker
Neptune
FLOYD
Paperspace
Determined AI
Domino data lab

Contributing

References:

[1]: Full Stack Deep Learning Bootcamp

[2]: Advanced KubeFlow Workshop by Pipeline.ai

[3]: TFX: Real World Machine Learning in Production

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
Airflow		Airflow
Kubeflow		Kubeflow
TFX		TFX
images		images
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Guideline to Production Level Deep Learning [Under developement]

1. Data Management

1.1. Data Sources

1.2. Labeling

1.3. Storage

1.4. Versioning

1.5. Processing

2. Development, Training, and Evaluation

2.1. Software engineering

2.2. Resource Management

2.3. DL Frameworks

2.4. Experiment management

2.5. Hyperparameter Tuning

2.6. Distributed Training

3. Troubleshooting [TBD]

4. Testing and Deployment

4.1. Testing and CI/CD

4.2. Web Depolyment

4.4. Monitoring:

4.5. Deploying on Embedded and Mobile Devices

4.6. All-in-one solutions

Other useful links:

Contributing

References:

About

Uh oh!

Releases

Packages

soham96/Production-Level-Deep-Learning

Folders and files

Latest commit

History

Repository files navigation

A Guideline to Production Level Deep Learning [Under developement]

1. Data Management

1.1. Data Sources

1.2. Labeling

1.3. Storage

1.4. Versioning

1.5. Processing

2. Development, Training, and Evaluation

2.1. Software engineering

2.2. Resource Management

2.3. DL Frameworks

2.4. Experiment management

2.5. Hyperparameter Tuning

2.6. Distributed Training

3. Troubleshooting [TBD]

4. Testing and Deployment

4.1. Testing and CI/CD

4.2. Web Depolyment

4.4. Monitoring:

4.5. Deploying on Embedded and Mobile Devices

4.6. All-in-one solutions

Other useful links:

Contributing

References:

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages