Skip to content

Commit 44dac51

Browse files
ArekSredzkipytorchmergebot
authored andcommitted
Improve Autograd Documentation Clarity (#89401)
This makes minor adjustments to the autograd docs, improving clarity and resolving grammatical errors Pull Request resolved: #89401 Approved by: https://github.com/kit1980
1 parent 49ccc41 commit 44dac51

File tree

1 file changed

+26
-26
lines changed

1 file changed

+26
-26
lines changed

docs/source/notes/autograd.rst

+26-26
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ programs, and can aid you in debugging.
1313
How autograd encodes the history
1414
--------------------------------
1515

16-
Autograd is reverse automatic differentiation system. Conceptually,
16+
Autograd is a reverse automatic differentiation system. Conceptually,
1717
autograd records a graph recording all of the operations that created
1818
the data as you execute operations, giving you a directed acyclic graph
1919
whose leaves are the input tensors and roots are the output tensors.
@@ -23,11 +23,11 @@ compute the gradients using the chain rule.
2323
Internally, autograd represents this graph as a graph of
2424
:class:`Function` objects (really expressions), which can be
2525
:meth:`~torch.autograd.Function.apply` ed to compute the result of
26-
evaluating the graph. When computing the forwards pass, autograd
26+
evaluating the graph. When computing the forward pass, autograd
2727
simultaneously performs the requested computations and builds up a graph
2828
representing the function that computes the gradient (the ``.grad_fn``
2929
attribute of each :class:`torch.Tensor` is an entry point into this graph).
30-
When the forwards pass is completed, we evaluate this graph in the
30+
When the forward pass is completed, we evaluate this graph in the
3131
backwards pass to compute the gradients.
3232

3333
An important thing to note is that the graph is recreated from scratch at every
@@ -119,7 +119,7 @@ For more fine-grained exclusion of subgraphs from gradient computation,
119119
there is setting the ``requires_grad`` field of a tensor.
120120

121121
Below, in addition to discussing the mechanisms above, we also describe
122-
evaluation mode (:meth:`nn.Module.eval()`), a method that is not actually used
122+
evaluation mode (:meth:`nn.Module.eval()`), a method that is not used
123123
to disable gradient computation but, because of its name, is often mixed up with the three.
124124

125125
Setting ``requires_grad``
@@ -164,16 +164,16 @@ of the module's parameters (which have ``requires_grad=True`` by default).
164164
Grad Modes
165165
^^^^^^^^^^
166166

167-
Apart from setting ``requires_grad`` there are also three possible modes
168-
enableable from Python that can affect how computations in PyTorch are
167+
Apart from setting ``requires_grad`` there are also three grad modes that can
168+
be selected from Python that can affect how computations in PyTorch are
169169
processed by autograd internally: default mode (grad mode), no-grad mode,
170170
and inference mode, all of which can be togglable via context managers and
171171
decorators.
172172

173173
Default Mode (Grad Mode)
174174
^^^^^^^^^^^^^^^^^^^^^^^^
175175

176-
The "default mode" is actually the mode we are implicitly in when no other modes like
176+
The "default mode" is the mode we are implicitly in when no other modes like
177177
no-grad and inference mode are enabled. To be contrasted with
178178
"no-grad mode" the default mode is also sometimes called "grad mode".
179179

@@ -237,7 +237,7 @@ For implementation details of inference mode see
237237
Evaluation Mode (``nn.Module.eval()``)
238238
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
239239

240-
Evaluation mode is not actually a mechanism to locally disable gradient computation.
240+
Evaluation mode is not a mechanism to locally disable gradient computation.
241241
It is included here anyway because it is sometimes confused to be such a mechanism.
242242

243243
Functionally, ``module.eval()`` (or equivalently ``module.train(False)``) are completely
@@ -263,21 +263,21 @@ In-place operations with autograd
263263
Supporting in-place operations in autograd is a hard matter, and we discourage
264264
their use in most cases. Autograd's aggressive buffer freeing and reuse makes
265265
it very efficient and there are very few occasions when in-place operations
266-
actually lower memory usage by any significant amount. Unless you're operating
266+
lower memory usage by any significant amount. Unless you're operating
267267
under heavy memory pressure, you might never need to use them.
268268

269269
There are two main reasons that limit the applicability of in-place operations:
270270

271271
1. In-place operations can potentially overwrite values required to compute
272272
gradients.
273273

274-
2. Every in-place operation actually requires the implementation to rewrite the
274+
2. Every in-place operation requires the implementation to rewrite the
275275
computational graph. Out-of-place versions simply allocate new objects and
276276
keep references to the old graph, while in-place operations, require
277277
changing the creator of all inputs to the :class:`Function` representing
278278
this operation. This can be tricky, especially if there are many Tensors
279279
that reference the same storage (e.g. created by indexing or transposing),
280-
and in-place functions will actually raise an error if the storage of
280+
and in-place functions will raise an error if the storage of
281281
modified inputs is referenced by any other :class:`Tensor`.
282282

283283
In-place correctness checks
@@ -338,18 +338,18 @@ serializing all the backward calls in a specific order during execution
338338
Non-determinism
339339
^^^^^^^^^^^^^^^
340340

341-
If you are calling ``backward()`` on multiple thread concurrently but with
342-
shared inputs (i.e. Hogwild CPU training). Since parameters are automatically
343-
shared across threads, gradient accumulation might become non-deterministic on
344-
backward calls across threads, because two backward calls might access and try
345-
to accumulate the same ``.grad`` attribute. This is technically not safe, and
346-
it might result in racing condition and the result might be invalid to use.
341+
If you are calling ``backward()`` from multiple threads concurrently and have
342+
shared inputs (i.e. Hogwild CPU training), then non-determinsim should be expected.
343+
This can occur because parameters are automatically shared across threads,
344+
as such, multiple threads may access and try to accumulate the same ``.grad``
345+
attribute during gradient accumulation. This is technically not safe, and
346+
it might result in race condition and the result might be invalid to use.
347347

348-
But this is expected pattern if you are using the multithreading approach to
349-
drive the whole training process but using shared parameters, user who use
350-
multithreading should have the threading model in mind and should expect this
351-
to happen. User could use the functional API :func:`torch.autograd.grad` to
352-
calculate the gradients instead of ``backward()`` to avoid non-determinism.
348+
Users developing multithreaded models featuring shared parameters should have the
349+
threading model in mind and should understand the issues described above.
350+
351+
The functional API :func:`torch.autograd.grad` may be used to calculate the
352+
gradients instead of ``backward()`` to avoid non-determinism.
353353

354354
Graph retaining
355355
^^^^^^^^^^^^^^^
@@ -368,9 +368,9 @@ Thread Safety on Autograd Node
368368

369369
Since Autograd allows the caller thread to drive its backward execution for
370370
potential parallelism, it's important that we ensure thread safety on CPU with
371-
parallel backwards that share part/whole of the GraphTask.
371+
parallel ``backward()`` calls that share part/whole of the GraphTask.
372372

373-
Custom Python ``autograd.Function`` is automatically thread safe because of GIL.
373+
Custom Python ``autograd.Function``\s are automatically thread safe because of GIL.
374374
For built-in C++ Autograd Nodes (e.g. AccumulateGrad, CopySlices) and custom
375375
``autograd::Function``\s, the Autograd Engine uses thread mutex locking to ensure
376376
thread safety on autograd Nodes that might have state write/read.
@@ -440,8 +440,8 @@ It also turns out that no interesting real-valued objective fulfill the
440440
Cauchy-Riemann equations. So the theory with homomorphic function cannot be
441441
used for optimization and most people therefore use the Wirtinger calculus.
442442

443-
Wirtinger Calculus comes in picture ...
444-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
443+
Wirtinger Calculus comes into the picture ...
444+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
445445

446446
So, we have this great theory of complex differentiability and
447447
holomorphic functions, and we can’t use any of it at all, because many

0 commit comments

Comments
 (0)