ORT Mindmap

User Interface
1. InferenceSession
  1. Read inference_session.h to get familiar with the common interface functions, Following are the important ones for training
    1. Load()
    2. Run()
    3. NewIOBinding()
    4. RegisterGraphTransformer()
    5. RegisterExecutionProvider()
    6. Advance: Initialize()
      1. What happens under the hood when we call session.initalize()?
  2. Understand the configs in SessionOptions (session_option.h) Followings are important ones for training
    1. execution_order
    2. enable_mem_pattern
    3. use_deterministic_compute
    4. session_log_severity_level
2. IOBinding (IOBinding.h)
  1. Prerequisite: What is an ORTValue? (ml_value.h)
  2. BindInput()
    1. How to create an ORTValue?
  3. BindOutput()
    1. With preallocated buffer
    2. Without preallocated buffer
      1. Who allocates the output buffer? and How is this returned back to user?
  4. What should be the lifespan of an IOBinding? Can user reuse IOBinding accross multiple Session::Run()?
  5. How is binded inputs/outputs passed into ExecutionFrame?
  6. Advance: How is IOBinding different from dlpack's approach? What are their advantages?
3. ORTModule (ortmodule.py)
  1. Read training_agent.h to understand the available interface functions
  2. Understand how ORTModule uses TrainingAgent
  3. Read ORTModule forward() and backward() function
    1. How does ORTModule get an ONNX graph from torch nn.module?
    2. How is ORT doing the auto-diff without torch's autograd engine?
    3. How is ORT hijacking torch's foward/backward call?
4. Advance: PyBind
  1. How's C++ InferenceSession used as python's onnxruntime.InferenceSession?
  2. Read onnxruntime_pybind_state.cc, onnxruntime_inference_collection.py for InferenceSession binding
  3. Read orttraining_pybind_state.cc for TrainingAgent binding
Graph
1. Basic building blocks to describe a computation Graph
  1. Node (graph.h/.cc)
    1. What's the difference between an Op and a Node?
    2. What are the common properties for a node?
      1. Can a node's name be empty?
      2. What's the identifier of a node in a graph? Index or Name?
    3. Advance: Function Ops, node with FunctionBody
  2. Graph (graph.h/.cc)
    1. How to traverse from one node to another node?
    2. What's the difference between a GraphInput and an Initializer?
    3. Look for an example using GetProducerNode() and GetConsumerNodes()
    4. What's the purpose of Graph::Resolve()?
      1. How is ShapeAndTypeInference invoked?
  3. NodeArg (node_arg.h)
    1. What the relationship between a graph edge and a NodeArg?
    2. What's the unique identifier of a NodeArg in a graph?
    3. Action: Look for some example using Graph::GetOrCreateNodeArg() (You will need to use this at some point)
2. Graph Transformers
  1. Must known Transformers
    1. Read one Fusion transformer
      1. MatmulAddFusion or LayerNormalizationFusion
      2. your own pick
    2. Read one Elimination RewriteRule
      1. IdentityElimination
      2. your own pick
    3. CommonSubexpressionElimination
    4. CostantFolding
  2. Understand the difference between GraphTransformer and RewriteRule
  3. Understanding the purpose of GraphTransformerManager
    1. How to register a set of graph transformers into a session?
    2. Why are registered transformers invoked a loop?
    3. What determines the order of how transformers are applied? Does the applying order matters?
  4. Understanding the two versions of graph_transformer_utils.cc (onnxruntime/orttraining ones)
    1. When is GeneratePreTrainingTransformers applied?
    2. How to determine if a transformer should be applied on the inference graph or training graph?
  5. What's different level of graph transformer means?
  6. Get familiar with graph_utils.cc
  7. Experiment with onnx.helper to compose a onnx model from the script (see transpose_matmul_gen.py for examples)
  8. Action: Implement a graph transformer to get hands-on experience
3. Training Graph
  1. Understand the workflow of training graph transformation
  2. Understand GraphAugmenter (graph_augmenter.h/.cc)
  3. GradientGraphBuilder
    1. What are the usage of Sum op in the gradient graph?
    2. Understand the purpose/usage of STOP_GRADIENT_EDGES
    3. Understand the meaning of x_node_args/y_node_args
    4. Advance: Understand the back-propagation process in GradientGraphBuilder::Build()
  4. Per Op GradientBuilder
    1. Understand the Gradient Registry (gradient_builder_registry.cc)
    2. Understand the Gradient Builder Declaration (gradient_builder.h)
    3. Read a few examples in Gradient Builder Implementation (gradient_builder.cc)
      1. Understand the shorthands of I, GI, O, GO (gradient_builder_base.h)
      2. Understand how gradient subgraph is composed with existing ops, followings are good examples
      3. Easy: GetDropoutGradient, GetSqrtGradient
      4. Medium: GetAddSubGradient, GetMulGradient
      5. Hard: GetMatMulGradient, GetGemmGradient
      6. Understand how broadcasting is handled when building gradient graph GradientBuilderBase::HandleBroadcasting()
      7. Action: Implement a gradient definition for an op to get hands-on experience
Op and Kernels
1. Understand the difference between Schema and Kernel
2. Onnx
  1. Read onnx.proto and onnx-ml.proto and understand the design principle behind it
  2. Get familiar with the Onnx Operaters: https://github.com/onnx/onnx/blob/master/docs/Operators.md
    1. Must know: Dropout, MatMul, Gemm, Transpose, ReduceSum, Reshape
  3. Understand the concept and purpose of opset, domain
    1. When to use which?
      1. onnx domain
      2. msdomain
  4. Understand the C++ data structure in onnx::TensorProto, onnx::AttributeProto, onnx::TypeProto
  5. Understand how Shape and Type Inference works in the schema definition
  6. Function Ops
3. Op Schema
  1. Understand the difference among the following 3 sets of schema. When to use which?
    1. Onnx's op Schema (onnx repo: defs.cc)
    2. contrib ops (contrib_defs.cc)
      1. Good to know: LayerNorm, Gelu
    3. training ops (training_op_defs.cc)
  2. Action: Add an op or update an op's schema to get hands-on experience
4. Op Kernels
  1. Kernel Declaration and Registory
    1. Understand when to use which registry for a kernel
    2. Inference Kernels
      1. Onnx Op Kernels
      2. cpu_execution_provider.cc
      3. cuda_execution_provider.cc
      4. Contrib Op Kernels
      5. cpu_contrib_kernels.cc
      6. cuda_contrib_kernels.cc
      7. Advance: rocm_contrib_kernels.cc
    3. Training Kernels
      1. CPU (cpu_training_kernels.cc)
      2. CUDA (cuda_training_kernels.cc)
      3. Advance: Rocm (rocm_training_kernels.cc)
  2. Kernel Implementation
    1. Tensor vs. OrtValue
      1. Read tensor.h and ml_value.h
      2. What's the difference between Tensor and OrtValue? Why we need two classes?
      3. How to get a Tensor from OrtValue?
      4. How to get data's raw pointer from a Tensor?
    2. Kernel Definition
      1. When to use Alias() and VariadicAlias()?
      2. How to set TypeConstraint()?
      3. When to use InputMemoryType<OrtMemTypeCPUInput>?
    3. CPU Kernel vs. CUDA Kernel
      1. What does it mean to have a CPU input/output for a CUDA kernel?
    4. Gradient Kernels
      1. Examples
      2. Easy: DropoutGrad, GeluGrad
      3. Medium: GatherGrad
      4. Hard: LayerNormalizationGrad
      5. Understand how to write unit tests to check gradient's correctness
    5. Understand how to user OpTester in UnitTest
    6. Action: Implement a kernel to get hands-on experience
Performance Investigation
1. Profiling Tools
  1. nvprof
    1. try run with/without --print-gpu-summary
    2. try --profile-child-processes
    3. Action: profile a training run
  2. Visual Profiler UI
    1. Use ruler to measure a time span
    2. Identify the top hitters in kernels
    3. Compare two sets of profiling results to identify the performance gap
    4. Can you identify the start/end of a train_step from the timeline view?
  3. torch profiler
  4. Linux perf
2. CUDA Kernels Optimization
ExecutionProvider
1. What is execution provider? What problems does it solve?(execution_provider.h)
2. CPU and CUDA are the most commonly used EPs in training (cpu/cuda_execution_provider.cc)
3. How to register execution provider into a session? or in ortmodule interface?
4. What's the functionality of ExecutionProvider::GetCapability()?
Execution Engine
1. InferenceSession::Run()
  1. Read RunOptions and understand the options (run_option.h)
2. SequentialExecutor::Execute()
  1. What's the purpose of ExecutionFrame? (execution_frame.h)
    1. How is one nodes output passed in as another node's input?
    2. What happens when we call context->Output() inside an op kernel?
    3. How are feeds and fetches stored in ExecutionFrame?
  2. How is the execution order determined? (graph_viewer.cc)
    1. Default execution order uses Graph::ReverseDFS() to generated topological sort
    2. Priority-based execution order uses Graph::KahnsTopologicalSort with per-node priority
  3. How is each node's kernel invoked ?
  4. How does ORT guarantees all the cuda kernel is completed before session.run return?
3. Advance: GraphPartitioner
  1. How each node is determined to be place on which execution provider? (graph_partitioner.h)
Memory
1. BFCArena
  1. Why we need an arena? What problem does it solve?
2. Memory Planning
3. Memory Pattern
  1. How does ORT come up with a peak memory consumption?
4. External Torch's CUDACachingAllocator
  1. How does ORTModule uses pytorch allocator? (ortmoduel.py)
  2. Advance: What's the difference between BFCArena and CUDACachingAllocator?
CUDA Programming
1. CUDA programming basics
  1. Understand the hardward
    1. Architecture Generations
      1. P100: Pascale / sm60
      2. V100: Volta / sm 70
      3. A100: Amper/ sm 80
    2. CUDA Core vs. Tensor Core
  2. Programming model
    1. Thread
    2. Block
    3. Grid
    4. Stream
  3. Must known funnctions
    1. cudaMalloc() vs. cudaFree()
    2. cudaMemcpy() vs. cudaMemcpyAsync()
    3. cudsMemset() vs. cudaMemsetAsync()
    4. cudaStreamSynchronize() vs. cudaDeviceSynchronize()
    5. cudaEventRecord() vs. cudaStreamWaitEvent()
2. Common tricks
  1. Avoid memcpy
  2. Avoid unnecessary Sync
  3. Preprocess data in CPU
  4. when to use #pragma unroll?
3. CUDA Kernel Examples
  1. Easy: Dropout/DropGrad
  2. Medium: SoftmaxCrossEntropyLoss(Grad)
  3. Hard: LayerNormalization, ReduceSum, GatherGrad
4. Debugging CUDA kernels
  1. printf() is working inside cuda code
  2. Memcpy data to CPU for inspection?
5. Understanding IO bound and compute bound
Distributed Training
1. Prerequisites: NCCL
  1. Read https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html
2. Good to know: MPI
  1. Good read: https://mpitutorial.com/tutorials/
3. Data Parallelism
  1. Understand NCCLAllReduce
  2. Get familiar with DDP usage/setup
4. Megatron
  1. Read https://arxiv.org/abs/1909.08053
5. Zero
  1. Read https://arxiv.org/abs/1910.02054
  2. Zero-1
    1. Understand ReduceScatter/AllGather
    2. Understand how optimizer state is partitioned
  3. Zero-2
  4. Zero-3
6. Mix of Experts
  1. Understand All2All
7. Pipeline Parallelism
Know-hows
1. Conda
2. Docker
3. VScode
  1. Setting up VScode with remote VM
  2. Debugging within Vscode
4. Debugging with gdb / pdb
5. Common debugging Tricks
  1. Getting the .onnx inference/training graph
  2. Enable I/O Dump
  3. Enable execution plan and memory plan dump
  4. Enable CPU profiling dump
  5. Enable CUDA memory consumption logs
Learning Roadmap
1. Level 0
  1. InferenceSession/ORTModule
  2. Graph/Node/NodeArg
  3. Onnx/Op/Schema/Kernel
  4. ORTValue/Tensor
  5. Per-op Gradient Building
  6. Performance Investigation
2. Level 1
  1. GraphTransformer
  2. ExecutionProvider
  3. IOBinding/dlpack
  4. PyBind
  5. Gradient Graph Building
  6. CUDA Programming
3. Level 2
  1. Execution Engine
    1. SessionState
    2. ExecutionFrame
  2. Memory
  3. Distributed Training
4. Level 3
  1. Performance optimization for CUDA kernels
Model Training Domain Knowledge
1. ML Knowledge
  1. Understand the meaning and implications of common configurations: batch size, seq len, learning rate, weight decay, global norm, loss scale...
  2. Familiarize with the common patterns in loss decreasing curve, spot abnormal patterns
  3. Understand the difference between optimizers: SGD, Adam and LAMB
  4. Advance: Understanding Backpropagation https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
2. Know-hows
  1. Familiar with running/monitoring AML experiments
  2. Familiarize with setting up tensorboard
  3. Action: submit a distributed training job to AML cluster and get familiar with it's user interface/logging/available metrics
3. Convergence Investigation
  1. Remove all randomness in the program
    1. Set Seeds
    2. Set Dropout Ratio to 0
    3. Set use_deterministice_compute=True
  2. Shrink the reproducible condition to the very minimal, as long as it can still repro
    1. Use 1 layer model
    2. Use smaller hidden_size
    3. Use single GPU
    4. ...
  3. Common Tricks
    1. Set the learning rate to 0 to disable model change
  4. Advance: how to do hyper-parameter tuning to get the model to converge better?
4. Action: Train a model E2E to get hands-on experience