@@ -13,7 +13,7 @@ programs, and can aid you in debugging.
13
13
How autograd encodes the history
14
14
--------------------------------
15
15
16
- Autograd is reverse automatic differentiation system. Conceptually,
16
+ Autograd is a reverse automatic differentiation system. Conceptually,
17
17
autograd records a graph recording all of the operations that created
18
18
the data as you execute operations, giving you a directed acyclic graph
19
19
whose leaves are the input tensors and roots are the output tensors.
@@ -23,11 +23,11 @@ compute the gradients using the chain rule.
23
23
Internally, autograd represents this graph as a graph of
24
24
:class: `Function ` objects (really expressions), which can be
25
25
:meth: `~torch.autograd.Function.apply ` ed to compute the result of
26
- evaluating the graph. When computing the forwards pass, autograd
26
+ evaluating the graph. When computing the forward pass, autograd
27
27
simultaneously performs the requested computations and builds up a graph
28
28
representing the function that computes the gradient (the ``.grad_fn ``
29
29
attribute of each :class: `torch.Tensor ` is an entry point into this graph).
30
- When the forwards pass is completed, we evaluate this graph in the
30
+ When the forward pass is completed, we evaluate this graph in the
31
31
backwards pass to compute the gradients.
32
32
33
33
An important thing to note is that the graph is recreated from scratch at every
@@ -119,7 +119,7 @@ For more fine-grained exclusion of subgraphs from gradient computation,
119
119
there is setting the ``requires_grad `` field of a tensor.
120
120
121
121
Below, in addition to discussing the mechanisms above, we also describe
122
- evaluation mode (:meth: `nn.Module.eval() `), a method that is not actually used
122
+ evaluation mode (:meth: `nn.Module.eval() `), a method that is not used
123
123
to disable gradient computation but, because of its name, is often mixed up with the three.
124
124
125
125
Setting ``requires_grad ``
@@ -164,16 +164,16 @@ of the module's parameters (which have ``requires_grad=True`` by default).
164
164
Grad Modes
165
165
^^^^^^^^^^
166
166
167
- Apart from setting ``requires_grad `` there are also three possible modes
168
- enableable from Python that can affect how computations in PyTorch are
167
+ Apart from setting ``requires_grad `` there are also three grad modes that can
168
+ be selected from Python that can affect how computations in PyTorch are
169
169
processed by autograd internally: default mode (grad mode), no-grad mode,
170
170
and inference mode, all of which can be togglable via context managers and
171
171
decorators.
172
172
173
173
Default Mode (Grad Mode)
174
174
^^^^^^^^^^^^^^^^^^^^^^^^
175
175
176
- The "default mode" is actually the mode we are implicitly in when no other modes like
176
+ The "default mode" is the mode we are implicitly in when no other modes like
177
177
no-grad and inference mode are enabled. To be contrasted with
178
178
"no-grad mode" the default mode is also sometimes called "grad mode".
179
179
@@ -237,7 +237,7 @@ For implementation details of inference mode see
237
237
Evaluation Mode (``nn.Module.eval() ``)
238
238
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
239
239
240
- Evaluation mode is not actually a mechanism to locally disable gradient computation.
240
+ Evaluation mode is not a mechanism to locally disable gradient computation.
241
241
It is included here anyway because it is sometimes confused to be such a mechanism.
242
242
243
243
Functionally, ``module.eval() `` (or equivalently ``module.train(False) ``) are completely
@@ -263,21 +263,21 @@ In-place operations with autograd
263
263
Supporting in-place operations in autograd is a hard matter, and we discourage
264
264
their use in most cases. Autograd's aggressive buffer freeing and reuse makes
265
265
it very efficient and there are very few occasions when in-place operations
266
- actually lower memory usage by any significant amount. Unless you're operating
266
+ lower memory usage by any significant amount. Unless you're operating
267
267
under heavy memory pressure, you might never need to use them.
268
268
269
269
There are two main reasons that limit the applicability of in-place operations:
270
270
271
271
1. In-place operations can potentially overwrite values required to compute
272
272
gradients.
273
273
274
- 2. Every in-place operation actually requires the implementation to rewrite the
274
+ 2. Every in-place operation requires the implementation to rewrite the
275
275
computational graph. Out-of-place versions simply allocate new objects and
276
276
keep references to the old graph, while in-place operations, require
277
277
changing the creator of all inputs to the :class: `Function ` representing
278
278
this operation. This can be tricky, especially if there are many Tensors
279
279
that reference the same storage (e.g. created by indexing or transposing),
280
- and in-place functions will actually raise an error if the storage of
280
+ and in-place functions will raise an error if the storage of
281
281
modified inputs is referenced by any other :class: `Tensor `.
282
282
283
283
In-place correctness checks
@@ -338,18 +338,18 @@ serializing all the backward calls in a specific order during execution
338
338
Non-determinism
339
339
^^^^^^^^^^^^^^^
340
340
341
- If you are calling ``backward() `` on multiple thread concurrently but with
342
- shared inputs (i.e. Hogwild CPU training). Since parameters are automatically
343
- shared across threads, gradient accumulation might become non-deterministic on
344
- backward calls across threads, because two backward calls might access and try
345
- to accumulate the same `` .grad `` attribute . This is technically not safe, and
346
- it might result in racing condition and the result might be invalid to use.
341
+ If you are calling ``backward() `` from multiple threads concurrently and have
342
+ shared inputs (i.e. Hogwild CPU training), then non-determinsim should be expected.
343
+ This can occur because parameters are automatically shared across threads,
344
+ as such, multiple threads may access and try to accumulate the same `` .grad ``
345
+ attribute during gradient accumulation . This is technically not safe, and
346
+ it might result in race condition and the result might be invalid to use.
347
347
348
- But this is expected pattern if you are using the multithreading approach to
349
- drive the whole training process but using shared parameters, user who use
350
- multithreading should have the threading model in mind and should expect this
351
- to happen. User could use the functional API :func: `torch.autograd.grad ` to
352
- calculate the gradients instead of ``backward() `` to avoid non-determinism.
348
+ Users developing multithreaded models featuring shared parameters should have the
349
+ threading model in mind and should understand the issues described above.
350
+
351
+ The functional API :func: `torch.autograd.grad ` may be used to calculate the
352
+ gradients instead of ``backward() `` to avoid non-determinism.
353
353
354
354
Graph retaining
355
355
^^^^^^^^^^^^^^^
@@ -368,9 +368,9 @@ Thread Safety on Autograd Node
368
368
369
369
Since Autograd allows the caller thread to drive its backward execution for
370
370
potential parallelism, it's important that we ensure thread safety on CPU with
371
- parallel backwards that share part/whole of the GraphTask.
371
+ parallel `` backward() `` calls that share part/whole of the GraphTask.
372
372
373
- Custom Python ``autograd.Function `` is automatically thread safe because of GIL.
373
+ Custom Python ``autograd.Function ``\s are automatically thread safe because of GIL.
374
374
For built-in C++ Autograd Nodes (e.g. AccumulateGrad, CopySlices) and custom
375
375
``autograd::Function ``\s , the Autograd Engine uses thread mutex locking to ensure
376
376
thread safety on autograd Nodes that might have state write/read.
@@ -440,8 +440,8 @@ It also turns out that no interesting real-valued objective fulfill the
440
440
Cauchy-Riemann equations. So the theory with homomorphic function cannot be
441
441
used for optimization and most people therefore use the Wirtinger calculus.
442
442
443
- Wirtinger Calculus comes in picture ...
444
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
443
+ Wirtinger Calculus comes into the picture ...
444
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
445
445
446
446
So, we have this great theory of complex differentiability and
447
447
holomorphic functions, and we can’t use any of it at all, because many
0 commit comments