@@ -504,20 +504,152 @@ Links
504504 - `Scipy sparse matrix formats documentation <https://docs.scipy.org/doc/scipy/reference/sparse.html >`_
505505
506506Parallelism, resource management, and configuration
507- =====================================================
507+ ===================================================
508508
509509.. _parallelism :
510510
511- Parallel and distributed computing
512- -----------------------------------
511+ Parallelism
512+ -----------
513513
514- Scikit-learn uses the `joblib <https://joblib.readthedocs.io/en/latest/ >`__
515- library to enable parallel computing inside its estimators. See the
516- joblib documentation for the switches to control parallel computing.
514+ Some scikit-learn estimators and utilities can parallelize costly operations
515+ using multiple CPU cores, thanks to the following components:
516+
517+ - via the `joblib <https://joblib.readthedocs.io/en/latest/ >`_ library. In
518+ this case the number of threads or processes can be controlled with the
519+ ``n_jobs `` parameter.
520+ - via OpenMP, used in C or Cython code.
521+
522+ In addition, some of the numpy routines that are used internally by
523+ scikit-learn may also be parallelized if numpy is installed with specific
524+ numerical libraries such as MKL, OpenBLAS, or BLIS.
525+
526+ We describe these 3 scenarios in the following subsections.
527+
528+ Joblib-based parallelism
529+ ........................
530+
531+ When the underlying implementation uses joblib, the number of workers
532+ (threads or processes) that are spawned in parallel can be controled via the
533+ ``n_jobs `` parameter.
534+
535+ .. note ::
536+
537+ Where (and how) parallelization happens in the estimators is currently
538+ poorly documented. Please help us by improving our docs and tackle `issue
539+ 14228 <https://github.com/scikit-learn/scikit-learn/issues/14228> `_!
540+
541+ Joblib is able to support both multi-processing and multi-threading. Whether
542+ joblib chooses to spawn a thread or a process depends on the **backend **
543+ that it's using.
544+
545+ Scikit-learn generally relies on the ``loky `` backend, which is joblib's
546+ default backend. Loky is a multi-processing backend. When doing
547+ multi-processing, in order to avoid duplicating the memory in each process
548+ (which isn't reasonable with big datasets), joblib will create a `memmap
549+ <https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html> `_
550+ that all processes can share, when the data is bigger than 1MB.
551+
552+ In some specific cases (when the code that is run in parallel releases the
553+ GIL), scikit-learn will indicate to ``joblib `` that a multi-threading
554+ backend is preferable.
555+
556+ As a user, you may control the backend that joblib will use (regardless of
557+ what scikit-learn recommends) by using a context manager::
558+
559+ from joblib import parallel_backend
560+
561+ with parallel_backend('threading', n_jobs=2):
562+ # Your scikit-learn code here
563+
564+ Please refer to the `joblib's docs
565+ <https://joblib.readthedocs.io/en/latest/parallel.html#thread-based-parallelism-vs-process-based-parallelism> `_
566+ for more details.
567+
568+ In practice, whether parallelism is helpful at improving runtime depends on
569+ many factors. It is usually a good idea to experiment rather than assuming
570+ that increasing the number of workers is always a good thing. In some cases
571+ it can be highly detrimental to performance to run multiple copies of some
572+ estimators or functions in parallel (see oversubscription below).
573+
574+ OpenMP-based parallelism
575+ ........................
576+
577+ OpenMP is used to parallelize code written in Cython or C, relying on
578+ multi-threading exclusively. By default (and unless joblib is trying to
579+ avoid oversubscription), the implementation will use as many threads as
580+ possible.
581+
582+ You can control the exact number of threads that are used via the
583+ ``OMP_NUM_THREADS `` environment variable::
584+
585+ OMP_NUM_THREADS=4 python my_script.py
586+
587+ Parallel Numpy routines from numerical libraries
588+ ................................................
589+
590+ Scikit-learn relies heavily on NumPy and SciPy, which internally call
591+ multi-threaded linear algebra routines implemented in libraries such as MKL,
592+ OpenBLAS or BLIS.
593+
594+ The number of threads used by the OpenBLAS, MKL or BLIS libraries can be set
595+ via the ``MKL_NUM_THREADS ``, ``OPENBLAS_NUM_THREADS ``, and
596+ ``BLIS_NUM_THREADS `` environment variables.
597+
598+ Please note that scikit-learn has no direct control over these
599+ implementations. Scikit-learn solely relies on Numpy and Scipy.
600+
601+ .. note ::
602+ At the time of writing (2019), NumPy and SciPy packages distributed on
603+ pypi.org (used by ``pip ``) and on the conda-forge channel are linked
604+ with OpenBLAS, while conda packages shipped on the "defaults" channel
605+ from anaconda.org are linked by default with MKL.
606+
607+
608+ Oversubscription: spawning too many threads
609+ ...........................................
610+
611+ It is generally recommended to avoid using significantly more processes or
612+ threads than the number of CPUs on a machine. Over-subscription happens when
613+ a program is running too many threads at the same time.
614+
615+ Suppose you have a machine with 8 CPUs. Consider a case where you're running
616+ a :class: `~GridSearchCV ` (parallelized with joblib) with ``n_jobs=8 `` over
617+ a :class: `~HistGradientBoostingClassifier ` (parallelized with OpenMP). Each
618+ instance of :class: `~HistGradientBoostingClassifier ` will spawn 8 threads
619+ (since you have 8 CPUs). That's a total of ``8 * 8 = 64 `` threads, which
620+ leads to oversubscription of physical CPU resources and to scheduling
621+ overhead.
622+
623+ Oversubscription can arise in the exact same fashion with parallelized
624+ routines from MKL, OpenBLAS or BLIS that are nested in joblib calls.
625+
626+ Starting from ``joblib >= 0.14 ``, when the ``loky `` backend is used (which
627+ is the default), joblib will tell its child **processes ** to limit the
628+ number of threads they can use, so as to avoid oversubscription. In practice
629+ the heuristic that joblib uses is to tell the processes to use ``max_threads
630+ = n_cpus // n_jobs ``, via their corresponding environment variable. Back to
631+ our example from above, since the joblib backend of :class: `~GridSearchCV `
632+ is ``loky ``, each process will only be able to use 1 thread instead of 8,
633+ thus mitigating the oversubscription issue.
634+
635+ Note that:
636+
637+ - Manually setting one of the environment variables (``OMP_NUM_THREADS ``,
638+ ``MKL_NUM_THREADS ``, ``OPENBLAS_NUM_THREADS ``, or ``BLIS_NUM_THREADS ``)
639+ will take precedence over what joblib tries to do. The total number of
640+ threads will be ``n_jobs * <LIB>_NUM_THREADS ``. Note that setting this
641+ limit will also impact your computations in the main process, which will
642+ only use ``<LIB>_NUM_THREADS ``. Joblib exposes a context manager for
643+ finer control over the number of threads in its workers (see joblib docs
644+ linked below).
645+ - Joblib is currently unable to avoid oversubscription in a
646+ multi-threading context. It can only do so with the ``loky `` backend
647+ (which spawns processes).
648+
649+ You will find additional details about joblib mitigation of oversubscription
650+ in `joblib documentation
651+ <https://joblib.readthedocs.io/en/latest/parallel.html#avoiding-over-subscription-of-cpu-ressources> `_.
517652
518- Note that, by default, scikit-learn uses its embedded (vendored) version
519- of joblib. A configuration switch (documented below) controls this
520- behavior.
521653
522654Configuration switches
523655-----------------------
0 commit comments