-
User Interface
-
InferenceSession
-
Read inference_session.h to get familiar with the common interface functions,
Following are the important ones for training
- Load()
- Run()
- NewIOBinding()
- RegisterGraphTransformer()
- RegisterExecutionProvider()
-
Advance: Initialize()
- What happens under the hood when we call session.initalize()?
-
Understand the configs in SessionOptions (session_option.h)
Followings are important ones for training
- execution_order
- enable_mem_pattern
- use_deterministic_compute
- session_log_severity_level
-
IOBinding (IOBinding.h)
- Prerequisite: What is an ORTValue? (ml_value.h)
-
BindInput()
- How to create an ORTValue?
-
BindOutput()
- With preallocated buffer
-
Without preallocated buffer
- Who allocates the output buffer? and How is this returned back to user?
- What should be the lifespan of an IOBinding? Can user reuse IOBinding accross multiple Session::Run()?
- How is binded inputs/outputs passed into ExecutionFrame?
- Advance: How is IOBinding different from dlpack's approach? What are their advantages?
-
ORTModule (ortmodule.py)
- Read training_agent.h to understand the available interface functions
- Understand how ORTModule uses TrainingAgent
-
Read ORTModule forward() and backward() function
- How does ORTModule get an ONNX graph from torch nn.module?
- How is ORT doing the auto-diff without torch's autograd engine?
- How is ORT hijacking torch's foward/backward call?
-
Advance: PyBind
- How's C++ InferenceSession used as python's onnxruntime.InferenceSession?
- Read onnxruntime_pybind_state.cc, onnxruntime_inference_collection.py for InferenceSession binding
- Read orttraining_pybind_state.cc for TrainingAgent binding
-
Graph
-
Basic building blocks to describe a computation Graph
-
Node (graph.h/.cc)
- What's the difference between an Op and a Node?
-
What are the common properties for a node?
- Can a node's name be empty?
- What's the identifier of a node in a graph? Index or Name?
- Advance: Function Ops, node with FunctionBody
-
Graph (graph.h/.cc)
- How to traverse from one node to another node?
- What's the difference between a GraphInput and an Initializer?
- Look for an example using GetProducerNode() and GetConsumerNodes()
-
What's the purpose of Graph::Resolve()?
- How is ShapeAndTypeInference invoked?
-
NodeArg (node_arg.h)
- What the relationship between a graph edge and a NodeArg?
- What's the unique identifier of a NodeArg in a graph?
- Action: Look for some example using Graph::GetOrCreateNodeArg() (You will need to use this at some point)
-
Graph Transformers
-
Must known Transformers
-
Read one Fusion transformer
- MatmulAddFusion or LayerNormalizationFusion
- your own pick
-
Read one Elimination RewriteRule
- IdentityElimination
- your own pick
- CommonSubexpressionElimination
- CostantFolding
- Understand the difference between GraphTransformer and RewriteRule
-
Understanding the purpose of GraphTransformerManager
- How to register a set of graph transformers into a session?
- Why are registered transformers invoked a loop?
- What determines the order of how transformers are applied? Does the applying order matters?
-
Understanding the two versions of graph_transformer_utils.cc (onnxruntime/orttraining ones)
- When is GeneratePreTrainingTransformers applied?
- How to determine if a transformer should be applied on the inference graph or training graph?
- What's different level of graph transformer means?
- Get familiar with graph_utils.cc
- Experiment with onnx.helper to compose a onnx model from the script (see transpose_matmul_gen.py for examples)
- Action: Implement a graph transformer to get hands-on experience
-
Training Graph
- Understand the workflow of training graph transformation
- Understand GraphAugmenter (graph_augmenter.h/.cc)
-
GradientGraphBuilder
- What are the usage of Sum op in the gradient graph?
- Understand the purpose/usage of STOP_GRADIENT_EDGES
- Understand the meaning of x_node_args/y_node_args
- Advance: Understand the back-propagation process in GradientGraphBuilder::Build()
-
Per Op GradientBuilder
- Understand the Gradient Registry (gradient_builder_registry.cc)
- Understand the Gradient Builder Declaration (gradient_builder.h)
-
Read a few examples in Gradient Builder Implementation (gradient_builder.cc)
- Understand the shorthands of I, GI, O, GO (gradient_builder_base.h)
- Understand how gradient subgraph is composed with existing ops, followings are good examples
- Easy: GetDropoutGradient, GetSqrtGradient
- Medium: GetAddSubGradient, GetMulGradient
- Hard: GetMatMulGradient, GetGemmGradient
- Understand how broadcasting is handled when building gradient graph GradientBuilderBase::HandleBroadcasting()
- Action: Implement a gradient definition for an op to get hands-on experience
-
Op and Kernels
- Understand the difference between Schema and Kernel
-
Onnx
- Read onnx.proto and onnx-ml.proto and understand the design principle behind it
-
Get familiar with the Onnx Operaters: https://github.com/onnx/onnx/blob/master/docs/Operators.md
- Must know: Dropout, MatMul, Gemm, Transpose, ReduceSum, Reshape
-
Understand the concept and purpose of opset, domain
-
When to use which?
- onnx domain
- msdomain
- Understand the C++ data structure in onnx::TensorProto, onnx::AttributeProto, onnx::TypeProto
- Understand how Shape and Type Inference works in the schema definition
- Function Ops
-
Op Schema
-
Understand the difference among the following 3 sets of schema. When to use which?
- Onnx's op Schema (onnx repo: defs.cc)
-
contrib ops (contrib_defs.cc)
- Good to know: LayerNorm, Gelu
- training ops (training_op_defs.cc)
- Action: Add an op or update an op's schema to get hands-on experience
-
Op Kernels
-
Kernel Declaration and Registory
- Understand when to use which registry for a kernel
-
Inference Kernels
- Onnx Op Kernels
- cpu_execution_provider.cc
- cuda_execution_provider.cc
- Contrib Op Kernels
- cpu_contrib_kernels.cc
- cuda_contrib_kernels.cc
- Advance: rocm_contrib_kernels.cc
-
Training Kernels
- CPU (cpu_training_kernels.cc)
- CUDA (cuda_training_kernels.cc)
- Advance: Rocm (rocm_training_kernels.cc)
-
Kernel Implementation
-
Tensor vs. OrtValue
- Read tensor.h and ml_value.h
- What's the difference between Tensor and OrtValue? Why we need two classes?
- How to get a Tensor from OrtValue?
- How to get data's raw pointer from a Tensor?
-
Kernel Definition
- When to use Alias() and VariadicAlias()?
- How to set TypeConstraint()?
- When to use InputMemoryType<OrtMemTypeCPUInput>?
-
CPU Kernel vs. CUDA Kernel
- What does it mean to have a CPU input/output for a CUDA kernel?
-
Gradient Kernels
- Examples
- Easy: DropoutGrad, GeluGrad
- Medium: GatherGrad
- Hard: LayerNormalizationGrad
- Understand how to write unit tests to check gradient's correctness
- Understand how to user OpTester in UnitTest
- Action: Implement a kernel to get hands-on experience
-
Performance Investigation
-
Profiling Tools
-
nvprof
- try run with/without --print-gpu-summary
- try --profile-child-processes
- Action: profile a training run
-
Visual Profiler UI
- Use ruler to measure a time span
- Identify the top hitters in kernels
- Compare two sets of profiling results to identify the performance gap
- Can you identify the start/end of a train_step from the timeline view?
- torch profiler
- Linux perf
- CUDA Kernels Optimization
-
ExecutionProvider
- What is execution provider? What problems does it solve?(execution_provider.h)
- CPU and CUDA are the most commonly used EPs in training (cpu/cuda_execution_provider.cc)
- How to register execution provider into a session? or in ortmodule interface?
- What's the functionality of ExecutionProvider::GetCapability()?
-
Execution Engine
-
InferenceSession::Run()
- Read RunOptions and understand the options (run_option.h)
-
SequentialExecutor::Execute()
-
What's the purpose of ExecutionFrame? (execution_frame.h)
- How is one nodes output passed in as another node's input?
- What happens when we call context->Output() inside an op kernel?
- How are feeds and fetches stored in ExecutionFrame?
-
How is the execution order determined? (graph_viewer.cc)
- Default execution order uses Graph::ReverseDFS() to generated topological sort
- Priority-based execution order uses Graph::KahnsTopologicalSort with per-node priority
- How is each node's kernel invoked ?
- How does ORT guarantees all the cuda kernel is completed before session.run return?
-
Advance: GraphPartitioner
- How each node is determined to be place on which execution provider? (graph_partitioner.h)
-
Memory
-
BFCArena
- Why we need an arena? What problem does it solve?
- Memory Planning
-
Memory Pattern
- How does ORT come up with a peak memory consumption?
-
External Torch's CUDACachingAllocator
- How does ORTModule uses pytorch allocator? (ortmoduel.py)
- Advance: What's the difference between BFCArena and CUDACachingAllocator?
-
CUDA Programming
-
CUDA programming basics
-
Understand the hardward
-
Architecture Generations
- P100: Pascale / sm60
- V100: Volta / sm 70
- A100: Amper/ sm 80
- CUDA Core vs. Tensor Core
-
Programming model
- Thread
- Block
- Grid
- Stream
-
Must known funnctions
- cudaMalloc() vs. cudaFree()
- cudaMemcpy() vs. cudaMemcpyAsync()
- cudsMemset() vs. cudaMemsetAsync()
- cudaStreamSynchronize() vs. cudaDeviceSynchronize()
- cudaEventRecord() vs. cudaStreamWaitEvent()
-
Common tricks
- Avoid memcpy
- Avoid unnecessary Sync
- Preprocess data in CPU
- when to use #pragma unroll?
-
CUDA Kernel Examples
- Easy: Dropout/DropGrad
- Medium: SoftmaxCrossEntropyLoss(Grad)
- Hard: LayerNormalization, ReduceSum, GatherGrad
-
Debugging CUDA kernels
- printf() is working inside cuda code
- Memcpy data to CPU for inspection?
- Understanding IO bound and compute bound
-
Distributed Training
-
Prerequisites: NCCL
- Read https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html
-
Good to know: MPI
- Good read: https://mpitutorial.com/tutorials/
-
Data Parallelism
- Understand NCCLAllReduce
- Get familiar with DDP usage/setup
-
Megatron
- Read https://arxiv.org/abs/1909.08053
-
Zero
- Read https://arxiv.org/abs/1910.02054
-
Zero-1
- Understand ReduceScatter/AllGather
- Understand how optimizer state is partitioned
- Zero-2
- Zero-3
-
Mix of Experts
- Understand All2All
- Pipeline Parallelism
-
Know-hows
- Conda
- Docker
-
VScode
- Setting up VScode with remote VM
- Debugging within Vscode
- Debugging with gdb / pdb
-
Common debugging Tricks
- Getting the .onnx inference/training graph
- Enable I/O Dump
- Enable execution plan and memory plan dump
- Enable CPU profiling dump
- Enable CUDA memory consumption logs
-
Learning Roadmap
-
Level 0
- InferenceSession/ORTModule
- Graph/Node/NodeArg
- Onnx/Op/Schema/Kernel
- ORTValue/Tensor
- Per-op Gradient Building
- Performance Investigation
-
Level 1
- GraphTransformer
- ExecutionProvider
- IOBinding/dlpack
- PyBind
- Gradient Graph Building
- CUDA Programming
-
Level 2
-
Execution Engine
- SessionState
- ExecutionFrame
- Memory
- Distributed Training
-
Level 3
- Performance optimization for CUDA kernels
-
Model Training Domain Knowledge
-
ML Knowledge
- Understand the meaning and implications of common configurations: batch size, seq len, learning rate, weight decay, global norm, loss scale...
- Familiarize with the common patterns in loss decreasing curve, spot abnormal patterns
- Understand the difference between optimizers: SGD, Adam and LAMB
- Advance: Understanding Backpropagation https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
-
Know-hows
- Familiar with running/monitoring AML experiments
- Familiarize with setting up tensorboard
- Action: submit a distributed training job to AML cluster and get familiar with it's user interface/logging/available metrics
-
Convergence Investigation
-
Remove all randomness in the program
- Set Seeds
- Set Dropout Ratio to 0
- Set use_deterministice_compute=True
-
Shrink the reproducible condition to the very minimal, as long as it can still repro
- Use 1 layer model
- Use smaller hidden_size
- Use single GPU
- ...
-
Common Tricks
- Set the learning rate to 0 to disable model change
- Advance: how to do hyper-parameter tuning to get the model to converge better?
- Action: Train a model E2E to get hands-on experience