Notebook

Estadística con SciPy¶

SciPy es paquete que incluye una colección de algoritmos matemáticos y funciones construidas sobre el paquete NumPy. En esta clase nos vamos a centrar en cálculos estadísticos.

Como siempre lo primero es lo primero, importemos lo paquetes que vamos a utilizar:

In [1]:

import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

scipy.stats¶

Este módulo contiene un gran número de distribuciones de probabilidad, tanto continuas como discretas, así como un creciente número de funciones estadísticas.

In [15]:

# Importar el módulo entero
import scipy.stats as st

In [16]:

# Información
np.info(st)

==========================================
Statistical functions (:mod:`scipy.stats`)
==========================================

.. module:: scipy.stats

This module contains a large number of probability distributions as
well as a growing library of statistical functions.

Each included distribution is an instance of the class rv_continous:
For each given name the following methods are available:

.. autosummary::
   :toctree: generated/

   rv_continuous
   rv_continuous.pdf
   rv_continuous.logpdf
   rv_continuous.cdf
   rv_continuous.logcdf
   rv_continuous.sf
   rv_continuous.logsf
   rv_continuous.ppf
   rv_continuous.isf
   rv_continuous.moment
   rv_continuous.stats
   rv_continuous.entropy
   rv_continuous.fit
   rv_continuous.expect

Calling the instance as a function returns a frozen pdf whose shape,
location, and scale parameters are fixed.

Similarly, each discrete distribution is an instance of the class
rv_discrete:

.. autosummary::
   :toctree: generated/

   rv_discrete
   rv_discrete.rvs
   rv_discrete.pmf
   rv_discrete.logpmf
   rv_discrete.cdf
   rv_discrete.logcdf
   rv_discrete.sf
   rv_discrete.logsf
   rv_discrete.ppf
   rv_discrete.isf
   rv_discrete.stats
   rv_discrete.moment
   rv_discrete.entropy
   rv_discrete.expect

Continuous distributions
========================

.. autosummary::
   :toctree: generated/

   alpha             -- Alpha
   anglit            -- Anglit
   arcsine           -- Arcsine
   beta              -- Beta
   betaprime         -- Beta Prime
   bradford          -- Bradford
   burr              -- Burr
   cauchy            -- Cauchy
   chi               -- Chi
   chi2              -- Chi-squared
   cosine            -- Cosine
   dgamma            -- Double Gamma
   dweibull          -- Double Weibull
   erlang            -- Erlang
   expon             -- Exponential
   exponweib         -- Exponentiated Weibull
   exponpow          -- Exponential Power
   f                 -- F (Snecdor F)
   fatiguelife       -- Fatigue Life (Birnbaum-Sanders)
   fisk              -- Fisk
   foldcauchy        -- Folded Cauchy
   foldnorm          -- Folded Normal
   frechet_r         -- Frechet Right Sided, Extreme Value Type II (Extreme LB) or weibull_min
   frechet_l         -- Frechet Left Sided, Weibull_max
   genlogistic       -- Generalized Logistic
   genpareto         -- Generalized Pareto
   genexpon          -- Generalized Exponential
   genextreme        -- Generalized Extreme Value
   gausshyper        -- Gauss Hypergeometric
   gamma             -- Gamma
   gengamma          -- Generalized gamma
   genhalflogistic   -- Generalized Half Logistic
   gilbrat           -- Gilbrat
   gompertz          -- Gompertz (Truncated Gumbel)
   gumbel_r          -- Right Sided Gumbel, Log-Weibull, Fisher-Tippett, Extreme Value Type I
   gumbel_l          -- Left Sided Gumbel, etc.
   halfcauchy        -- Half Cauchy
   halflogistic      -- Half Logistic
   halfnorm          -- Half Normal
   hypsecant         -- Hyperbolic Secant
   invgamma          -- Inverse Gamma
   invgauss          -- Inverse Gaussian
   invweibull        -- Inverse Weibull
   johnsonsb         -- Johnson SB
   johnsonsu         -- Johnson SU
   ksone             -- Kolmogorov-Smirnov one-sided (no stats)
   kstwobign         -- Kolmogorov-Smirnov two-sided test for Large N (no stats)
   laplace           -- Laplace
   logistic          -- Logistic
   loggamma          -- Log-Gamma
   loglaplace        -- Log-Laplace (Log Double Exponential)
   lognorm           -- Log-Normal
   lomax             -- Lomax (Pareto of the second kind)
   maxwell           -- Maxwell
   mielke            -- Mielke's Beta-Kappa
   nakagami          -- Nakagami
   ncx2              -- Non-central chi-squared
   ncf               -- Non-central F
   nct               -- Non-central Student's T
   norm              -- Normal (Gaussian)
   pareto            -- Pareto
   pearson3          -- Pearson type III
   powerlaw          -- Power-function
   powerlognorm      -- Power log normal
   powernorm         -- Power normal
   rdist             -- R-distribution
   reciprocal        -- Reciprocal
   rayleigh          -- Rayleigh
   rice              -- Rice
   recipinvgauss     -- Reciprocal Inverse Gaussian
   semicircular      -- Semicircular
   t                 -- Student's T
   triang            -- Triangular
   truncexpon        -- Truncated Exponential
   truncnorm         -- Truncated Normal
   tukeylambda       -- Tukey-Lambda
   uniform           -- Uniform
   vonmises          -- Von-Mises (Circular)
   wald              -- Wald
   weibull_min       -- Minimum Weibull (see Frechet)
   weibull_max       -- Maximum Weibull (see Frechet)
   wrapcauchy        -- Wrapped Cauchy

Multivariate distributions
==========================

.. autosummary::
   :toctree: generated/

   multivariate_normal   -- Multivariate normal distribution

Discrete distributions
======================

.. autosummary::
   :toctree: generated/

   bernoulli         -- Bernoulli
   binom             -- Binomial
   boltzmann         -- Boltzmann (Truncated Discrete Exponential)
   dlaplace          -- Discrete Laplacian
   geom              -- Geometric
   hypergeom         -- Hypergeometric
   logser            -- Logarithmic (Log-Series, Series)
   nbinom            -- Negative Binomial
   planck            -- Planck (Discrete Exponential)
   poisson           -- Poisson
   randint           -- Discrete Uniform
   skellam           -- Skellam
   zipf              -- Zipf

Statistical functions
=====================

Several of these functions have a similar version in scipy.stats.mstats
which work for masked arrays.

.. autosummary::
   :toctree: generated/

   describe          -- Descriptive statistics
   gmean             -- Geometric mean
   hmean             -- Harmonic mean
   kurtosis          -- Fisher or Pearson kurtosis
   kurtosistest      --
   mode              -- Modal value
   moment            -- Central moment
   normaltest        --
   skew              -- Skewness
   skewtest          --
   tmean             -- Truncated arithmetic mean
   tvar              -- Truncated variance
   tmin              --
   tmax              --
   tstd              --
   tsem              --
   nanmean           -- Mean, ignoring NaN values
   nanstd            -- Standard deviation, ignoring NaN values
   nanmedian         -- Median, ignoring NaN values
   variation         -- Coefficient of variation

.. autosummary::
   :toctree: generated/

   cumfreq           _
   histogram2        _
   histogram         _
   itemfreq          _
   percentileofscore _
   scoreatpercentile _
   relfreq           _

.. autosummary::
   :toctree: generated/

   binned_statistic     -- Compute a binned statistic for a set of data.
   binned_statistic_2d  -- Compute a 2-D binned statistic for a set of data.
   binned_statistic_dd  -- Compute a d-D binned statistic for a set of data.

.. autosummary::
   :toctree: generated/

   obrientransform
   signaltonoise
   bayes_mvs
   sem
   zmap
   zscore

.. autosummary::
   :toctree: generated/

   sigmaclip
   threshold
   trimboth
   trim1

.. autosummary::
   :toctree: generated/

   f_oneway
   pearsonr
   spearmanr
   pointbiserialr
   kendalltau
   linregress

.. autosummary::
   :toctree: generated/

   ttest_1samp
   ttest_ind
   ttest_rel
   kstest
   chisquare
   power_divergence
   ks_2samp
   mannwhitneyu
   tiecorrect
   rankdata
   ranksums
   wilcoxon
   kruskal
   friedmanchisquare

.. autosummary::
   :toctree: generated/

   ansari
   bartlett
   levene
   shapiro
   anderson
   anderson_ksamp
   binom_test
   fligner
   mood

.. autosummary::
   :toctree: generated/

   boxcox
   boxcox_normmax
   boxcox_llf

   entropy

Contingency table functions
===========================

.. autosummary::
   :toctree: generated/

   chi2_contingency
   contingency.expected_freq
   contingency.margins
   fisher_exact

Plot-tests
==========

.. autosummary::
   :toctree: generated/

   ppcc_max
   ppcc_plot
   probplot
   boxcox_normplot


Masked statistics functions
===========================

.. toctree::

   stats.mstats


Univariate and multivariate kernel density estimation (:mod:`scipy.stats.kde`)
==============================================================================

.. autosummary::
   :toctree: generated/

   gaussian_kde

For many more stat related functions install the software R and the
interface package rpy.

Funciones¶

Carguemos unos datos, por ejemplo unas notas de la carrera, y veamos cómo podemos aprovechar las funciones de scipy.stats.

In [17]:

# esta línea no funciona en Windows
!head ../data/notas.csv

In [18]:

# Leemos el archivo
datos = np.loadtxt("../data/notas.csv", skiprows=1)
datos

Out[18]:

array([ 2.9,  4.3,  3.9,  0. ,  4.1,  7.3,  2.3,  5.6,  2.9,  3.9,  4.6,
        6.3,  2.1,  2.1,  6.5,  1.9,  0. ,  6.5,  2.5,  5.1,  5.3,  6.3,
        5.4,  5.3,  5.3,  2. ,  3.5,  4.4,  5.5,  3.6,  3.9,  2.5,  4.1,
        3. ,  4.6,  4. ,  6.3,  0.6,  2.4,  6.5,  2.3,  4.6,  6.9,  5.1,
        5.4,  5.3,  4.5,  6.5,  2.1,  5.5,  3.4,  8.1,  4. ,  1.9,  1.6,
        4.3,  4.6,  5.4,  1. ,  6.5,  5.5,  4.9,  4. ,  5.3,  3.5,  4.4,
        2.8,  5.4,  3.5,  2.3,  4.8,  2.1,  6.6,  0.5,  2.1,  3.1,  3.4,
        5.9,  3.4,  4.3,  1.5,  5.5,  4.4,  1.9,  4.4,  2.9,  3.9,  5.8,
        2.8,  3. ,  1.5,  2.6,  2.9,  3.4,  5.4,  3.6,  4.6,  5. ,  1.4,
        4.3,  4.6,  3.1,  2. ,  3.6,  4. ,  2.5,  3. ,  5.1,  6.4,  3.5,
        5.8,  4.1,  5.9,  4. ,  6.4,  2.3,  7. ,  1.4,  3.5,  4.4,  2.9,
        5.1,  3.4,  4.8,  4.6,  4.3,  6.9,  5.4,  4. ,  3.3,  1.4,  1.9,
        3.8,  3.4,  3.6,  3.8,  6.3,  4.8,  4. ,  6.8,  4. ,  3.6,  4.4,
        4.1,  6. ,  4.1,  5.6,  3.9,  4.6,  5. ,  6.5,  3.5,  5.5,  4.6,
        4.8,  4.6,  6.5,  4.1,  4.4,  5.3,  3.6,  7.1,  4.6,  2.1,  3.3,
        3.9,  4. ,  4.4,  0.9,  4.3,  2.4,  2.9,  3.6,  1.4,  2.8,  2.5,
        6.6,  0. ,  5.1,  0. ,  5.1,  4. ,  2.6,  4.1,  4.6,  3.1,  4.4,
        2.8,  2.8,  5.6,  3.9,  4.4,  4.1,  0.3,  2.4,  3.3,  2.5,  4.3,
        2.5,  4.5,  4.8,  4.3,  3.3,  6. ,  2.3,  4.5,  3.4,  4.5,  2.3,
        4.5,  2.5,  6.4,  7. ,  5.8,  3.4,  5.1,  4.9,  8.5,  3. ,  3.3,
        3.3,  2.6,  1.8,  2.9,  4.3,  2.1,  4.8,  5.3,  3.1,  5. ,  4.6,
        3.1,  5.6,  5.5,  4.3,  4.5,  5.4,  4.3,  1.5,  3.9,  5.4,  1.8,
        3.5,  5.3,  2.6,  3.5,  3. ,  4.9,  3.5,  3.4,  2. ,  4.5,  5.3,
        3. ,  1.6,  4.8,  3.8,  0.6,  4.1,  5.4,  7.5,  0.6,  2.6,  2.1,
        3.8,  3.3,  5.4,  3.6,  2.4,  4.6,  0.9,  5.8,  4.4,  1. ,  3.6,
        6.3,  4.4,  7.5,  5.9,  0.5,  4.3,  2.4,  6. ,  4.6,  5.1,  6.1,
        4. ,  4.5,  7.9,  3.5,  3.1,  5. ,  6.3,  4.9,  5.9,  4.6,  5.8,
        5.4,  0.1,  1.8,  5.1,  4. ,  2.4,  4.6,  4.9,  3.1,  1.4,  4. ,
        3.6,  4.3,  3.8,  4.4,  4.8,  5.1,  4. ,  0. ,  2.1,  5.9,  6.3,
        3.1,  6. ,  3.4,  1.9,  5.6,  5.3,  4.8,  2.6,  5.6,  4.8,  5.4,
        3.4,  5.3,  4.1,  3.8,  3.6,  3.9,  2. ,  3.5,  4. ,  3.6,  0.6,
        2.4,  3.9,  4.1,  2.8,  3. ,  5.1,  5.4,  3.9,  3.3,  3.8,  2.5,
        0.6,  2.8,  2.9,  7. ,  6. ,  2.8,  4. ,  4.9,  4.8,  2.8,  2. ,
        4. ,  2.6,  3.1,  2.9,  6.5,  4.3,  2.1,  3.9,  4.3,  0. ,  7.4,
        3.9])

In [19]:

# Descripción rápida de los datos
st.describe(datos)

Out[19]:

(375,
 (0.0, 8.5),
 3.9706666666666668,
 2.5927736185383243,
 -0.13203546994646295,
 -0.06611485627230884)

In [20]:

# Histograma con st
st.histogram(datos, numbins=10, defaultlimits=(0,10))

Out[20]:

(array([ 17.,  20.,  61.,  78.,  98.,  61.,  29.,   9.,   2.,   0.]),
 0,
 1.0,
 0)

In [21]:

# Pintemos un histograma con plt
plt.hist(datos, range(0,11,))
plt.xticks(range(0,11))
plt.grid(True)

plt.vlines(5, 0, 100, lw=5, colors='red', alpha=0.8)
plt.fill_between([0, 5], [100, 100], color='red', alpha=0.5)

Out[21]:

<matplotlib.collections.PolyCollection at 0x7f3f07891358>

In [22]:

# Pintemos un histograma acumulado con plt
plt.hist(datos, range(0,11), cumulative=True)
plt.xticks(range(0,11))

plt.vlines(5, 0, 400, lw=5, colors='red', alpha=0.8)
plt.fill_between([0, 5], [400, 400], color='red', alpha=0.5)

plt.grid(True)

In [23]:

# Percentil
st.percentileofscore(datos, 5)

Out[23]:

73.733333333333334

In [24]:

# Nota de un percentil
st.scoreatpercentile(datos, 50)

Out[24]:

4.0

¿Te parecen normales estas notas? No, no me refiero a si te gustan o no... Me refiero a que si crees que estas notas se distribuyen de manera gaussiana.

Distribuciones estadísticas¶

In [25]:

# Parámetros
med = st.nanmean(datos)
des_tip = st.nanstd(datos)

# Distribución normal
dist_normal = st.norm(loc=med, scale=des_tip)

Ahora podemos ver:

Función densidad de probabilidad (probability density function) pdf
Función de distribución (cumulative distribution function) cdf

De esta manera, nos ahorramos definir funciones como:

$$N(\mu,\sigma)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}$$$$\phi(x)=\intop_{-\infty}^{x}\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}dx$$

Para calcular probabilidades $P[a\leq X\leq b]=\intop_{a}^{b}f(x)dx$

In [26]:

# Calculamos la pdf
x = np.linspace(0, 10, 100)
y1 = dist_normal.pdf(x)
y1

Out[26]:

array([  1.18466112e-02,   1.38013272e-02,   1.60154276e-02,
         1.85117381e-02,   2.13131100e-02,   2.44420387e-02,
         2.79202305e-02,   3.17681221e-02,   3.60043569e-02,
         4.06452262e-02,   4.57040834e-02,   5.11907430e-02,
         5.71108776e-02,   6.34654270e-02,   7.02500358e-02,
         7.74545375e-02,   8.50625018e-02,   9.30508640e-02,
         1.01389654e-01,   1.10041839e-01,   1.18963303e-01,
         1.28102958e-01,   1.37403018e-01,   1.46799426e-01,
         1.56222440e-01,   1.65597374e-01,   1.74845495e-01,
         1.83885049e-01,   1.92632412e-01,   2.01003342e-01,
         2.08914298e-01,   2.16283816e-01,   2.23033887e-01,
         2.29091330e-01,   2.34389108e-01,   2.38867558e-01,
         2.42475512e-01,   2.45171267e-01,   2.46923389e-01,
         2.47711323e-01,   2.47525793e-01,   2.46368984e-01,
         2.44254502e-01,   2.41207102e-01,   2.37262214e-01,
         2.32465247e-01,   2.26870729e-01,   2.20541268e-01,
         2.13546393e-01,   2.05961280e-01,   1.97865419e-01,
         1.89341226e-01,   1.80472670e-01,   1.71343910e-01,
         1.62038000e-01,   1.52635673e-01,   1.43214236e-01,
         1.33846588e-01,   1.24600387e-01,   1.15537363e-01,
         1.06712791e-01,   9.81751290e-02,   8.99658014e-02,
         8.21191408e-02,   7.46624629e-02,   6.76162683e-02,
         6.09945536e-02,   5.48052151e-02,   4.90505281e-02,
         4.37276824e-02,   3.88293574e-02,   3.43443198e-02,
         3.02580270e-02,   2.65532243e-02,   2.32105212e-02,
         2.02089381e-02,   1.75264140e-02,   1.51402697e-02,
         1.30276213e-02,   1.11657426e-02,   9.53237395e-03,
         8.10597949e-03,   6.86595419e-03,   5.79278319e-03,
         4.86815747e-03,   4.07504992e-03,   3.39775625e-03,
         2.82190558e-03,   2.33444533e-03,   1.92360505e-03,
         1.57884349e-03,   1.29078296e-03,   1.05113462e-03,
         8.52617863e-04,   6.88876683e-04,   5.54395268e-04,
         4.44414802e-04,   3.54853006e-04,   2.82227567e-04,
         2.23584322e-04])

In [27]:

# La representamos
plt.plot(x, y1)
plt.grid(True)

In [28]:

# Calculamos la cdf
y2 = dist_normal.cdf(x)
y2

Out[28]:

array([ 0.00683286,  0.00812612,  0.00962974,  0.01137106,  0.01337976,
        0.01568777,  0.0183293 ,  0.02134067,  0.02476018,  0.02862791,
        0.03298542,  0.03787547,  0.04334158,  0.04942765,  0.05617737,
        0.06363372,  0.07183834,  0.08083088,  0.0906483 ,  0.1013242 ,
        0.11288805,  0.12536454,  0.13877285,  0.15312598,  0.1684302 ,
        0.18468445,  0.20187994,  0.21999975,  0.2390186 ,  0.25890272,
        0.27960985,  0.30108939,  0.32328269,  0.34612342,  0.36953818,
        0.39344715,  0.41776493,  0.44240138,  0.46726269,  0.49225242,
        0.5172726 ,  0.54222491,  0.56701185,  0.59153783,  0.61571033,
        0.63944089,  0.66264614,  0.68524861,  0.70717752,  0.72836942,
        0.74876866,  0.76832781,  0.78700782,  0.80477817,  0.82161679,
        0.83750988,  0.85245164,  0.86644386,  0.87949545,  0.89162183,
        0.90284437,  0.91318964,  0.92268879,  0.93137677,  0.93929165,
        0.94647392,  0.95296581,  0.95881062,  0.96405221,  0.96873436,
        0.97290038,  0.9765926 ,  0.97985205,  0.98271818,  0.98522854,
        0.98741866,  0.98932189,  0.99096933,  0.99238974,  0.99360961,
        0.99465314,  0.99554231,  0.99629699,  0.99693499,  0.99747225,
        0.99792288,  0.99829939,  0.99861271,  0.99887244,  0.9990869 ,
        0.99926327,  0.99940776,  0.99952567,  0.9996215 ,  0.99969908,
        0.99976165,  0.99981191,  0.99985212,  0.99988416,  0.9999096 ])

In [29]:

# La representamos
plt.plot(x, y2)
plt.grid(True)

Del mismo modo se pueden usar otras distribuciones continuas o discretas e incluso, definir distribuciones propias. Pero sigamos con las notas...

Tests¶

Ahora que ya hemos visualizado la distribución de las notas y que sabemos generar distribuciones normales. ¿Por qué no hacemos un test de Kolmogórov-Smirnov?

Se trata de ver lo bien o lo mal que se ajusta la distribución a una normal con $\mu=3.97$ y $\sigma²=2.57$

In [30]:

bars = st.histogram(datos, numbins=10, defaultlimits=(0,10))[0]
bars /= 375

plt.bar(np.arange(0,10), bars, alpha=0.5, width=1)
plt.plot(x, y1, c='black', lw=2)

plt.grid(True)

In [31]:

plt.hist(datos, np.linspace(0,10,51), cumulative=True, alpha=0.5)
plt.plot(x, y2 * 375, lw=2, c='black')

plt.xticks(range(0,11))
plt.grid(True)

In [32]:

datos2 = dist_normal.cdf
st.kstest(datos, dist_normal.cdf)

Out[32]:

(0.04783071674813294, 0.34838712365988389)

Se rechaza la hipótesis nula si el valor p asociado al resultado observado es igual o menor que el nivel de significación establecido, convencionalmente 0,05 ó 0,01. Es decir, el valor p nos muestra la probabilidad de haber obtenido el resultado que hemos obtenido si suponemos que la hipótesis nula es cierta. (Wikipedia)

Si probamos con st.normaltest que también comprueba la bondad del ajuste obtenemos un valor-p más alto:

In [33]:

st.normaltest(datos)

Out[33]:

(1.1315306710610515, 0.56792532695583819)

En definitiva, parece que las notas esta vez siguieron una normal con $\mu=3.97$ y $\sigma²=2.57$

####

¡Síguenos en Twitter!
###### Follow @AeroPython
###### Este notebook ha sido realizado por: Juan Luis Cano, y Álex Sáez
#####
Curso AeroPython por Juan Luis Cano Rodriguez y Alejandro Sáez Mollejo se distribuye bajo una Licencia Creative Commons Atribución 4.0 Internacional.

Las siguientes celdas contienen configuración del Notebook

Para visualizar y utlizar los enlaces a Twitter el notebook debe ejecutarse como seguro

File > Trusted Notebook

In [1]:

# Esta celda da el estilo al notebook
from IPython.core.display import HTML
css_file = '../styles/aeropython.css'
HTML(open(css_file, "r").read())

Out[1]:

/* This template is inspired in the one used by Lorena Barba in the numerical-mooc repository: https://github.com/numerical-mooc/numerical-mooc We thank her work and hope you also enjoy the look of the notobooks with this style */ El estilo se ha aplicado =)